Theoretical Syntax is on the verge of a change in its data gathering paradigm through the incorporation of the analysis of big data and the use of citizen science and crowdsourcing to collect linguistic data. The goal of this Research Topic is to provide as comprehensive an overview as possible of the opportunities for innovation and the challenges inherent in the emerging paradigm to help researchers navigate these changes. While these tools provide cost-efficient access to data, which can help the field adopt stronger quantitative standards, we would particularly like contributors to explore how Theoretical Syntax would also benefit from the systematic investigation of potential issues, e.g., with data quality, data granularity mismatches or ethics. In particular, we welcome methodological contributions and detailed case studies on specific languages that explore the following issues and beyond:
Big Data: How to utilise this resource fruitfully and what are its limits?
The use of big data for Theoretical Linguistics provides a unique access to subpopulations and linguistic phenomena under-represented in the theoretical literature. At the same time, there are potential granularity mismatches in data sets studied within Theoretical Syntax, which may include highly infrequent or even unattested or ungrammatical data while controlling for independent factors, and those available through the study of big data.
Crowdsourcing and Citizen Science: How to utilise this resource while controlling for data quality and avoiding potential ethical issues?
The use of this resource entails analyzing data from untrained informants. This may help avoid potential shortcomings like experimental bias, but potential data quality issues call for extensive validation studies, as done by Sprouse for Amazon’s Mechanical Turk (MTurk) in 2011, as well as research on strategies to improve data quality. Furthermore, ethical concerns may arise, because (a.) popular crowdsourcing services as MTurk do not give any employment benefits to their workers and payment rates are below minimal wages, as noted by Fort, Adda and Cohen in 2011; (b.) citizen science, while helping promote scientific awareness or gather data that would otherwise be hard to obtain, tends to be combined with volunteerism and the gamification in data gathering, thus potentially treating citizens as a source of cheap labor.
Statistics vs. Traditional Theoretical Research: How can Theoretical Syntax benefit from this paradigm change?
The hypothesized link between big data, citizen science and crowdsourcing, on the one hand, and statistical scrutiny on the other, favors a debate on the advantages and limits of the use of statistical tools standard in Cognitive Science for the study of Syntax, including but not limited to issues relating to the convergence rate between data collected using traditional methods and data gathered as part of the said paradigm change. Likewise, the use of big data, crowdsourcing and citizen science provides an opportunity to tap into varieties and phenomena that might not even be present in current corpora or the theoretical literature, thus opening the door to the fruitful study of microvariation combined with statistics (Dialectometry) to increase the empirical basis of the studies on microvariation.
Theoretical Syntax is on the verge of a change in its data gathering paradigm through the incorporation of the analysis of big data and the use of citizen science and crowdsourcing to collect linguistic data. The goal of this Research Topic is to provide as comprehensive an overview as possible of the opportunities for innovation and the challenges inherent in the emerging paradigm to help researchers navigate these changes. While these tools provide cost-efficient access to data, which can help the field adopt stronger quantitative standards, we would particularly like contributors to explore how Theoretical Syntax would also benefit from the systematic investigation of potential issues, e.g., with data quality, data granularity mismatches or ethics. In particular, we welcome methodological contributions and detailed case studies on specific languages that explore the following issues and beyond:
Big Data: How to utilise this resource fruitfully and what are its limits?
The use of big data for Theoretical Linguistics provides a unique access to subpopulations and linguistic phenomena under-represented in the theoretical literature. At the same time, there are potential granularity mismatches in data sets studied within Theoretical Syntax, which may include highly infrequent or even unattested or ungrammatical data while controlling for independent factors, and those available through the study of big data.
Crowdsourcing and Citizen Science: How to utilise this resource while controlling for data quality and avoiding potential ethical issues?
The use of this resource entails analyzing data from untrained informants. This may help avoid potential shortcomings like experimental bias, but potential data quality issues call for extensive validation studies, as done by Sprouse for Amazon’s Mechanical Turk (MTurk) in 2011, as well as research on strategies to improve data quality. Furthermore, ethical concerns may arise, because (a.) popular crowdsourcing services as MTurk do not give any employment benefits to their workers and payment rates are below minimal wages, as noted by Fort, Adda and Cohen in 2011; (b.) citizen science, while helping promote scientific awareness or gather data that would otherwise be hard to obtain, tends to be combined with volunteerism and the gamification in data gathering, thus potentially treating citizens as a source of cheap labor.
Statistics vs. Traditional Theoretical Research: How can Theoretical Syntax benefit from this paradigm change?
The hypothesized link between big data, citizen science and crowdsourcing, on the one hand, and statistical scrutiny on the other, favors a debate on the advantages and limits of the use of statistical tools standard in Cognitive Science for the study of Syntax, including but not limited to issues relating to the convergence rate between data collected using traditional methods and data gathered as part of the said paradigm change. Likewise, the use of big data, crowdsourcing and citizen science provides an opportunity to tap into varieties and phenomena that might not even be present in current corpora or the theoretical literature, thus opening the door to the fruitful study of microvariation combined with statistics (Dialectometry) to increase the empirical basis of the studies on microvariation.