Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over 1012 URIs from over 5x109 pages crawled in April 2014 and April 2017, the second study adds a further 3x10pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information. 
Original languageEnglish
Title of host publicationCompanion Proceedings of the The Web Conference 2018
Place of PublicationLyon, France
PublisherInternational World Wide Web Conferences Steering Committee
Number of pages7
ISBN (Electronic)978-1-4503-5640-4
Publication statusPublished - 23 Apr 2018
EventThe Web Conference 2018 - Lyon, France
Duration: 23 Apr 201827 Apr 2018


ConferenceThe Web Conference 2018
Abbreviated titleTheWebConf 2018
Internet address

Keywords / Materials (for Non-textual outputs)

  • temporal web analytics
  • Digital Object Identifier
  • longitudinal web crawl analysis
  • Uniform Resource Identifier
  • persistent identifier
  • Common Crawl


Dive into the research topics of 'Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time'. Together they form a unique fingerprint.

Cite this