Skip to main content

REVIEW article

Front. Artif. Intell.
Sec. Natural Language Processing
Volume 7 - 2024 | doi: 10.3389/frai.2024.1472411

The Sociolinguistic Foundations of Language Modeling

Provisionally accepted
Jack Grieve Jack Grieve *Sara Bartl Sara Bartl Matteo Fuoli Matteo Fuoli Jason Grafmiller Jason Grafmiller Weihang Huang Weihang Huang Alejandro Jawerbaum Alejandro Jawerbaum Akira Murakami Akira Murakami Marcus Perlman Marcus Perlman Dana Roemling Dana Roemling Bodo Winter Bodo Winter
  • University of Birmingham, Birmingham, United Kingdom

The final, formatted version of the article will be published soon.

    In this paper, we introduce a sociolinguistic perspective on language modeling. We claim that language models in general are inherently modeling varieties of language, and we consider how this insight can inform the development and deployment of language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective could help us better understand five basic challenges in language modeling: social bias, domain adaptation, alignment, language change, and scale. We argue that to maximize the performance and societal value of language models it is important to carefully compile training corpora that accurately represent the specific varieties of language being modeled, drawing on theories, methods, and descriptions from the field of sociolinguistics.

    Keywords: AI ethics, artificial intelligence, Computational sociolinguistics, corpus linguistics, Large language models, Natural Language Processing, Varieties of Language

    Received: 29 Jul 2024; Accepted: 30 Nov 2024.

    Copyright: © 2024 Grieve, Bartl, Fuoli, Grafmiller, Huang, Jawerbaum, Murakami, Perlman, Roemling and Winter. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Jack Grieve, University of Birmingham, Birmingham, United Kingdom

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.