A set of shared coding conventions for speaker ethnicity is necessary for open-source data sharing and cross-study compatibility between linguistic corpora. However, ethnicity, like many other aspects of speaker identity, is continually negotiated and reproduced in discourse, and therefore a challenge to code representatively. This paper discusses some of the challenges facing researchers who want to use, create, or contribute to existing corpora that are annotated for the ethnic identity of a speaker. We specifically problematize the macro-social label ‘Asian American’ and propose that researchers should consider different levels and types of specificity of ‘Asianness’ in order to ensure that the corpora best represent the reality of ethnic identity in the community sampled. This is particularly important given the limited incorporation of different Asian groups in most existing linguistic research).We argue that more rigorous coding for Asian American ethnicities in corpora will improve the utility of archived corpora and enhance sociolinguistic research on language variation and ethnic identity.
- asian american
- data management