TY - JOUR
T1 - Exploring data provenance in handwritten text recognition infrastructure
T2 - Sharing and reusing ground truth data, referencing models, and acknowledging contributions. Starting the conversation on how we could get it done
AU - Romein, C. Annemieke
AU - Hodel, Tobias
AU - Gordijn, Femke
AU - van Zundert, Joris
AU - Chagué, Alix
AU - van Lange, Milan
AU - Strandgaard Jensen, Helle
AU - Stauder, Andy
AU - Purcell, Jake
AU - Terras, Melissa
AU - van den Heuvel, Pauline
AU - Keijzer, Carlijn
AU - Rabus, Achim
AU - Sitaram, Chantal
AU - Bhatia, Aakriti
AU - Depuydt, Katrien
AU - Afolabi, Mary Aderonke
AU - Anikina, Anastasiia
AU - Bastianello, Elisa
AU - Benzinger, Lukas Vincent
AU - Bosse, Arno
AU - Brown, David
AU - Charlton, Ashleigh
AU - Nilsson Dannevig, André
AU - van Gelder, Klaas
AU - Go, Sabine C. P. J.
AU - Goh, Marcus J.C.
AU - Gstrein, Silvia
AU - Hasan, Sewa
AU - von der Heide, Stefan
AU - Hindermann, Maximilian
AU - Huff, Dorothee
AU - Huysman, Ineke
AU - Idris, Ali
AU - Keijser, Liesbeth
AU - Kemper, Simon
AU - Koenders, Sanne
AU - Kuijpers, Erika
AU - Rønsig Larsen, Lisette
AU - Lepa, Sven
AU - Link, Tommy O.
AU - van Nispen, Annalies
AU - Nockels, Joe
AU - van Noort, Laura M.
AU - Oosterhuis, Joost Johannes
AU - Popken, Vivien
AU - Puertollano, María Estrella
AU - Puusaag, Joosep J.
AU - Sheta, Ahmed
AU - Stoop, Lex
AU - Strutzenbladh, Ebba
AU - van der Sijs, Nicoline
AU - van der Spek, Jan Paul
AU - Trouw, Barry Benaissa
AU - Van Synghel, Geertrui
AU - Vuckovic, Vladimir
AU - Wilbrink, Heleen
AU - Weiss, Sonia
AU - Wrisley, David Joseph
AU - Zweistra, Riet
PY - 2024/3/18
Y1 - 2024/3/18
N2 - This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, and ways to reference and acknowledge contributions to the creation and enrichment of data within these Machine Learning systems. We discuss how one can publish Ground Truth data in a repository and, subsequently, inform others. Furthermore, we suggest appropriate citation methods for HTR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of Machine Learning in archival and library contexts, and how the community should begin toacknowledge and record both contributions and data provenance.
AB - This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, and ways to reference and acknowledge contributions to the creation and enrichment of data within these Machine Learning systems. We discuss how one can publish Ground Truth data in a repository and, subsequently, inform others. Furthermore, we suggest appropriate citation methods for HTR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of Machine Learning in archival and library contexts, and how the community should begin toacknowledge and record both contributions and data provenance.
KW - automatic text recognition
KW - handwritten text recognition
KW - data publication
KW - open data
KW - data curation
KW - ground truth
KW - sharing
UR - https://jdmdh.episciences.org/13242
U2 - 10.46298/jdmdh.10403
DO - 10.46298/jdmdh.10403
M3 - Article
SN - 2416-5999
VL - Historical Documents and automatic text recognition
SP - 1
EP - 26
JO - Journal of Data Mining and Digital Humanities
JF - Journal of Data Mining and Digital Humanities
M1 - 10403
ER -