Abstract
We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over 1012 URIs from over 5x109 pages crawled in April 2014 and April 2017, the second study adds a further 3x109 pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.
Original language | English |
---|---|
Title of host publication | Companion Proceedings of the The Web Conference 2018 |
Place of Publication | Lyon, France |
Publisher | International World Wide Web Conferences Steering Committee |
Pages | 1749-1755 |
Number of pages | 7 |
ISBN (Electronic) | 978-1-4503-5640-4 |
DOIs | |
Publication status | Published - 23 Apr 2018 |
Event | The Web Conference 2018 - Lyon, France Duration: 23 Apr 2018 → 27 Apr 2018 https://www2018.thewebconf.org/ |
Conference
Conference | The Web Conference 2018 |
---|---|
Abbreviated title | TheWebConf 2018 |
Country/Territory | France |
City | Lyon |
Period | 23/04/18 → 27/04/18 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- temporal web analytics
- Digital Object Identifier
- longitudinal web crawl analysis
- Uniform Resource Identifier
- persistent identifier
- Common Crawl
Fingerprint
Dive into the research topics of 'Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time'. Together they form a unique fingerprint.Profiles
-
Henry Thompson
- School of Informatics - Personal Chair in Web Informatics
- Institute of Language, Cognition and Computation
Person: Academic: Research Active