Edinburgh Research Explorer

Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Related Edinburgh Organisations

Open Access permissions



  • Download as Adobe PDF

    Final published version, 1.44 MB, PDF document

    Licence: Creative Commons: Attribution (CC-BY)

Original languageEnglish
Title of host publicationCompanion Proceedings of the The Web Conference 2018
Place of PublicationLyon, France
PublisherInternational World Wide Web Conferences Steering Committee
Number of pages7
ISBN (Electronic)978-1-4503-5640-4
Publication statusPublished - 23 Apr 2018
EventThe Web Conference 2018 - Lyon, France
Duration: 23 Apr 201827 Apr 2018

Publication series

NameWWW '18


ConferenceThe Web Conference 2018
Abbreviated titleTheWebConf 2018
Internet address


We report here on the results of two studies using two and four
monthly web crawls respectively from the Common Crawl (CC)
initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over 1012 URIs from over 5x109 pages crawled in April 2014 and April 2017, the second study adds a further 3x10pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information. 

    Research areas

  • temporal web analytics, Digital Object Identifier, longitudinal web crawl analysis, Uniform Resource Identifier, persistent identifier, Common Crawl


The Web Conference 2018


Lyon, France

Event: Conference

Download statistics

No data available

ID: 59251483