Abstract
Modern computational text classification methods have brought social scientists tantalizingly close to the goal of unlocking vast insights buried in text data—from centuries of historical documents to streams of social media posts. Yet three barriers still stand in the way: the tedious labor of manual text annotation, the technical complexity that keeps these tools out of reach for many researchers, and, perhaps most critically, the challenge of bridging the gap between sophisticated algorithms and the deep theoretical understanding social scientists have already developed about human interactions, social structures, and institutions. To counter these limitations, we propose an approach to large-scale text analysis that requires substantially less human-labeled data, and no machine learning expertise, and efficiently integrates the social scientist into critical steps in the workflow. This approach, which allows the detection of statements in text, relies on large language models pre-trained for natural language inference, and a “few-shot” threshold-tuning algorithm rooted in active learning principles. We describe and showcase our approach by analyzing tweets collected during the 2020 U.S. presidential election campaign, and benchmark it against various computational approaches across three datasets.
Original language | English |
---|---|
Pages (from-to) | 1-48 |
Number of pages | 48 |
Journal | Sociological Methods & Research |
DOIs | |
Publication status | Published - 18 Apr 2025 |
Keywords / Materials (for Non-textual outputs)
- text analysis
- natural language processing
- computational methods
- active learning
- few-shot learning
- large language models