Edinburgh Research Explorer

Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Related Edinburgh Organisations

Open Access permissions

Open

Documents

  • Download as Adobe PDF

    Final published version, 1.44 MB, PDF document

    Licence: Creative Commons: Attribution (CC-BY)

https://dl.acm.org/citation.cfm?doid=3184558.3191636
Original languageEnglish
Title of host publicationCompanion Proceedings of the The Web Conference 2018
Place of PublicationLyon, France
PublisherInternational World Wide Web Conferences Steering Committee
Pages1749-1755
Number of pages7
ISBN (Electronic)978-1-4503-5640-4
DOIs
Publication statusPublished - 23 Apr 2018
EventThe Web Conference 2018 - Lyon, France
Duration: 23 Apr 201827 Apr 2018
https://www2018.thewebconf.org/

Publication series

NameWWW '18

Conference

ConferenceThe Web Conference 2018
Abbreviated titleTheWebConf 2018
CountryFrance
CityLyon
Period23/04/1827/04/18
Internet address

Abstract

We report here on the results of two studies using two and four
monthly web crawls respectively from the Common Crawl (CC)
initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over 1012 URIs from over 5x109 pages crawled in April 2014 and April 2017, the second study adds a further 3x10pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information. 

    Research areas

  • temporal web analytics, Digital Object Identifier, longitudinal web crawl analysis, Uniform Resource Identifier, persistent identifier, Common Crawl

Event

The Web Conference 2018

23/04/1827/04/18

Lyon, France

Event: Conference

Download statistics

No data available

ID: 59251483