AUTHOR=Kendall Tyler , Vaughn Charlotte , Farrington Charlie , Gunter Kaylynn , McLean Jaidan , Tacata Chloe , Arnson Shelby 

TITLE=Considering Performance in the Automated and Manual Coding of Sociolinguistic Variables: Lessons From Variable (ING)

JOURNAL=Frontiers in Artificial Intelligence

VOLUME=Volume 4 - 2021

YEAR=2021

URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2021.648543

DOI=10.3389/frai.2021.648543

ISSN=2624-8212

ABSTRACT=Impressionistic coding of sociolinguistic variables like English (ING), the alternation between pronunciations like talkin’ and talking, has been a central part of the analytic workflow in studies of language variation and change for over a half-century. Techniques for automating the measurement and coding for a wide range of sociolinguistic data have been on the rise over recent decades but procedures for coding some features, especially those without clearly defined acoustic correlates like (ING), have lagged behind others, such as vowels and sibilants. This paper explores computational methods for automatically coding variable (ING) in speech recordings, examining the use of automatic speech recognition procedures related to forced alignment (using the Montreal Forced Aligner) as well as supervised machine learning algorithms (linear and radial support vector machines, and random forests). Considering the automated coding of pronunciation variables like (ING) raises broader questions for sociolinguistic methods, such as how much different human analysts agree in their impressionistic codes for such variables and what data might act as the “gold standard” for training and testing. This paper explores several of these considerations in automated, and manual, coding of sociolinguistic variables and provides baseline performance data for automated and manual coding methods. We consider multiple ways of assessing algorithms’ performance, including agreement with human coders, as well as the impact on the outcome of an analysis of (ING) that includes linguistic and social factors. Our results show promise for automated coding methods but also highlight that variability in results should be expected even with careful human-coded data. All data for our study come from the public Corpus of Regional African American Language and code and derivative datasets (including our hand-coded data) are available as Supplementary Materials.