CivilSum: A dataset for abstractive summarization of Indian court decisions

Manuj Malik, Zheng Zhao, Marcio Fonseca, Shrisha Rao, Shay B. Cohen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Extracting relevant information from legal documents is a challenging task due to the technical complexity and volume of their content. These factors also increase the costs of annotating large datasets, which are required to train state-of-the-art summarization systems. To address these challenges, we introduce CivilSum, a collection of 23,350 legal case decisions from the Supreme Court of India and other Indian High Courts paired with human-written summaries. Compared to previous datasets such as IN-Abs, CivilSum not only has more legal decisions but also provides shorter and more abstractive summaries, thus offering a challenging benchmark for legal summarization. Unlike other domains such as news articles, our analysis shows the most important content tends to appear at the end of the documents. We measure the effect of this tail bias on summarization performance using strong architectures for long-document abstractive summarization, and the results highlight the importance of long sequence modeling for the proposed task. CivilSum and related code are publicly available to the research community to advance text summarization in the legal domain.
Original languageEnglish
Title of host publicationSIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherACM
Pages2241-2250
Number of pages10
ISBN (Electronic)9798400704314
DOIs
Publication statusPublished - 11 Jul 2024
Event47th International ACM SIGIR Conference on Research and Development in Information Retrieval - Washington D. C., United States
Duration: 14 Jul 202418 Jul 2024
https://sigir-2024.github.io/

Conference

Conference47th International ACM SIGIR Conference on Research and Development in Information Retrieval
Abbreviated titleSIGIR 2024
Country/TerritoryUnited States
CityWashington D. C.
Period14/07/2418/07/24
Internet address

Keywords / Materials (for Non-textual outputs)

  • abstractive text summarization
  • dataset
  • legal document summarization
  • legal IR

Fingerprint

Dive into the research topics of 'CivilSum: A dataset for abstractive summarization of Indian court decisions'. Together they form a unique fingerprint.

Cite this