Do Language Models Learn about Legal Entity Types during Pretraining?

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Language Models (LMs) have proven their ability to acquire diverse linguistic knowledge during the pretraining phase, potentially serving as a valuable source of incidental supervision for downstream tasks. However, there has been limited research conducted on the retrieval of domain-specific knowledge, and specifically legal knowledge. We propose to explore the task of Entity Typing, serving as a proxy for evaluating legal knowledge as an essential aspect of text comprehension, and a foundational task to numerous downstream legal NLP applications. Through systematic evaluation and analysis and two types of prompting (cloze sentences and QA-based templates) and to clarify the nature of these acquired cues, we compare diverse types and lengths of entities both general and domain-specific entities, semantics or syntax signals, and different LM pretraining corpus (generic and legal-oriented) and architectures (encoder BERT-based and decoder-only with Llama2). We show that (1) Llama2 performs well on certain entities and exhibits potential for substantial improvement with optimized prompt templates, (2) law-oriented LMs show inconsistent performance, possibly due to variations in their training corpus, (3) LMs demonstrate the ability to type entities even in the case of multi-token entities, (4) all models struggle with entities belonging to sub-domains of the law (5) Llama2 appears to frequently overlook syntactic cues, a shortcoming less present in BERT-based architectures. The code of the experiments is available at probing_legal_entity_types.
Original languageEnglish
Title of host publicationProceedings of the Natural Legal Language Processing Workshop (NLLP 23)
PublisherAssociation for Computational Linguistics (ACL)
Number of pages13
ISBN (Electronic)979-8-89176-054-7
Publication statusPublished - 7 Dec 2023
EventThe 5th Natural Legal Language Processing Workshop 2023 - , Singapore
Duration: 7 Dec 2023 → …
Conference number: 5


WorkshopThe 5th Natural Legal Language Processing Workshop 2023
Abbreviated titleNLLP 2023
Period7/12/23 → …
Internet address


Dive into the research topics of 'Do Language Models Learn about Legal Entity Types during Pretraining?'. Together they form a unique fingerprint.

Cite this