A document-feature matrix (DFM) with word frequencies of 1,118,397 news articles that mention the term "humanitarian*" published in English language media between January 1, 2010 and 15 August, 2020. Features in the DFM have not been lowercased, but the following elements have been removed: punctuation, symbols, numbers, separators, commonly used stopwords and words with two or fewer characters. To remove stopwords,Porter's Snowball list of 175 common English-language words was used. The DFM includes 79,229 features.
For each news article in the DFM, the following metadata are included: news organisation where it was published (news_source_name), country of the news organisation in ISO-3 format (country_iso3), continent of the news organisation in ISO-2 format (continent_iso2), reach of the news organisation (media_reach) and publication date in YYYYMMDD format (publication_date).
The file format is RDS, which needs to be read using the R programming language. It includes one single object called humanitarian_dfm, which has been created using R's quanteda package. This object can be easily converted to other formats commonly used for text mining, NLP and computational text analysis using quanteda's "convert" function.
| Date made available | 27 Mar 2024 |
|---|
| Publisher | Edinburgh DataShare |
|---|
| Temporal coverage | 1 Jan 2010 - 15 Aug 2020 |
|---|
| Geographical coverage | Global |
|---|