Coding for Demographic Categories in the Creation of Legacy Corpora: Asian American Ethnic Identities

Lauren Hall-Lew, Amy Wing-mei Wong

Research output: Contribution to journalArticle


A set of shared coding conventions for speaker ethnicity is necessary for open-source data sharing and cross-study compatibility between linguistic corpora. However, ethnicity, like many other aspects of speaker identity, is continually negotiated and reproduced in discourse, and therefore a challenge to code representatively. This paper discusses some of the challenges facing researchers who want to use, create, or contribute to existing corpora that are annotated for the ethnic identity of a speaker. We specifically problematize the macro-social label ‘Asian American’ and propose that researchers should consider different levels and types of specificity of ‘Asianness’ in order to ensure that the corpora best represent the reality of ethnic identity in the community sampled. This is particularly important given the limited incorporation of different Asian groups in most existing linguistic research).We argue that more rigorous coding for Asian American ethnicities in corpora will improve the utility of archived corpora and enhance sociolinguistic research on language variation and ethnic identity.
Original languageEnglish
Pages (from-to)564-576
Number of pages12
JournalLanguage and Linguistics Compass
Issue number11
Publication statusPublished - Nov 2014


  • corpora
  • metadata
  • methods
  • asian american
  • sociolinguistics
  • data management


Dive into the research topics of 'Coding for Demographic Categories in the Creation of Legacy Corpora: Asian American Ethnic Identities'. Together they form a unique fingerprint.

Cite this