Abstract / Description of output
Extracting relevant information from legal documents is a challenging task due to the technical complexity and volume of their content. These factors also increase the costs of annotating large datasets, which are required to train state-of-the-art summarization systems. To address these challenges, we introduce CivilSum, a collection of 23,350 legal case decisions from the Supreme Court of India and other Indian High Courts paired with human-written summaries. Compared to previous datasets such as IN-Abs, CivilSum not only has more legal decisions but also provides shorter and more abstractive summaries, thus offering a challenging benchmark for legal summarization. Unlike other domains such as news articles, our analysis shows the most important content tends to appear at the end of the documents. We measure the effect of this tail bias on summarization performance using strong architectures for long-document abstractive summarization, and the results highlight the importance of long sequence modeling for the proposed task. CivilSum and related code are publicly available to the research community to advance text summarization in the legal domain.
Original language | English |
---|---|
Title of host publication | SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval |
Publisher | ACM |
Pages | 2241-2250 |
Number of pages | 10 |
ISBN (Electronic) | 9798400704314 |
DOIs | |
Publication status | Published - 11 Jul 2024 |
Event | 47th International ACM SIGIR Conference on Research and Development in Information Retrieval - Washington D. C., United States Duration: 14 Jul 2024 → 18 Jul 2024 https://sigir-2024.github.io/ |
Conference
Conference | 47th International ACM SIGIR Conference on Research and Development in Information Retrieval |
---|---|
Abbreviated title | SIGIR 2024 |
Country/Territory | United States |
City | Washington D. C. |
Period | 14/07/24 → 18/07/24 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- abstractive text summarization
- dataset
- legal document summarization
- legal IR