Large Corpora and Historical Syntax: Consequences for the Study of Morphosyntactic Diffusion in the History of Spanish

Octavio de Toledo y Huerta, Álvaro S.

doi:10.3389/fpsyg.2019.00780

HYPOTHESIS AND THEORY article

Front. Psychol. , 17 April 2019

Sec. Psychology of Language

Volume 10 - 2019 | https://doi.org/10.3389/fpsyg.2019.00780

This article is part of the Research Topic Theoretical Syntax at the Crossroads: Big Data, Citizen Science and Crowdsourcing View all 6 articles

Large Corpora and Historical Syntax: Consequences for the Study of Morphosyntactic Diffusion in the History of Spanish

$\r\n lvaro S. Octavio de Toledo y Huerta*$ Álvaro S. Octavio de Toledo y Huerta^*

Department of Spanish (Filología Española), Universidad Autónoma de Madrid, Madrid, Spain

Over the turn of the 21st century, the use of data from large electronic corpora has changed research on Spanish historical syntax, spurring interest in long-range evolutions, and the shape of the correspondent diachronic curves. However, general reflections on diffusion and the factors that drive and influence it are still pretty much lacking. In this paper, I reflect on the research possibilities laid open by the availability of such large masses of data, focusing particularly on new knowledge on syntactic change brought about by the study of low-frequency phenomena and of recessive changes, as well as on the exploration of changes conditioned by dialect contact, and textual traditions. I conclude with some remarks on the general typology of diffusion in syntactic change.

It is widely recognized that the field of historical morphosyntax in Spanish has been gaining ground in the last four decades to the extent that it now occupies a position of clear dominance in the area of diachronic linguistic research on that language (cf. Cano, 1991, 1995, 2000; Concepción Company Company, 2005; Girón, 2005, 2006; Kabatek, 2012, amongst others). This is reflected in the persistent dominance of work in this field as recorded in the proceedings of the conferences organized by the Association of Spanish Historical Linguistics (AHLE, Asociación de Historia de la Lengua Española¹), a meeting held since 1987². Looking at the contributions to the Historical Morphology and Syntax section in AHLE proceedings and taking only those in which primary data are used (i.e., data have not been merely collated from previously published sources), it is possible to group those contributions according to the time period studied, as shown in Table 1.

TABLE 1

Table 1. Studies based on primary data collection included in the historical morphology and syntax section of the AHLE in the proceedings of the first seven conferences (1987–2006).

From Table 1 two important conclusions are easily drawn: first, the body of work dedicated exclusively to the study of medieval Spanish has been diminishing at a steady rate; and secondly, in contrast, those works seeking to capture the complete historical evolution of the language (holochronic studies) have increased dramatically in frequency – the largest increase occurring around the time of the 6th Conference (Madrid, 2003). This parallel and complementary change in focus may have numerous causes, of which two are, in my opinion, clearly prevailing: a growing tendency to carry out (and, what is more, accept the validity of) studies dealing with very long timescales based on very few sources for the individual historical periods studied – these sources being considered representative due to their iconic literary or cultural status – and, above all, the increasing ease of access to large databases of digitized resources. A few years ago, Rolf Eberenz made a timely observation on both these trends, adding a warning concerning the implications of both:

We have at our disposal a body of texts, the majority of which are “literary,” from which we select two or three works for each century to extract the data that interests us […]. The digital corpus and the tools provided by information technology permit both statistical and linguistic analysis of increasing precision. However, the plethora of information stored on computer brings with it a certain danger, that is, the data tend to become a shapeless mass in which the differences between texts and within individual texts become lost from view [Eberenz, 2009: 189 (my translation, ÁOdT)].

The Diachronic Corpus of Spanish (CORDE³) is by far the digital corpus that has had most impact on the discipline, as Table 1 suggests, since the major leap upward in the trend for holochronic studies coincides with the granting of general access to this corpus at the beginning of this century. CORDE offers many pit-falls for the researcher, and not just those of the type which Eberenz outlines. It has been noted, for example, that there is a lack of philological quality in some of the texts available. There are also issues concerning how texts are dated (the date is fixed according to the original composition rather than the date of the much later manuscripts upon which the versions in the corpus are often based), the searchability of the database itself (an especially important consideration for syntactic searches), and the distribution of the texts by historical period (with some periods being far better represented than others)⁴. Despite these shortcomings, CORDE represents —as stated by Guillermo Rojo, its main promoter— an “instrumental revolution” (Rojo, 2012: 433–434) that has transformed the discipline of historical linguistics at its foundations, i.e., from the access to data itself. A strong indication of CORDE’s influence can be found in the 36 studies of morphosyntax in the most recent AHLE (Cádiz, 2012, proceedings from 2015), of which only a quarter (9/36) were completed without access to an electronic corpus, while over half (19/36) used data mined from CORDE⁵. In a branch of linguistics so dependent upon the bulk of data, this step change in available tools has brought with it a considerable reassessment of its objectives (cf. also Jenset and McGillivray, 2017). Within only 20 years, a large proportion of scholars in this area appear to have moved their focus from the detailed study of medieval morphosyntax (with very few excursions into other periods) to the tracing of and analysis of holochronic trajectories (evolutionary curves) of their chosen linguistic phenomena.

The exponential increase in available data and the transformation in form and substance that the exploitation of these data have instigated thus constitute one of the most substantial changes to the field of historical morphosyntax in Spanish in recent years. Over the rest of this article, I will present an overview of the benefits of this abundance of data – deliberately leaving aside the disadvantages⁶. I will conventionally divide these advantages into two categories (although in practice they frequently overlap), namely quantitative ones, such as the ability to explore very low frequency phenomena or the establishment of more precise evolutionary trends, and qualitative ones, such as the opportunity to observe correlations between certain well defined trends or, supported by better data, further explore the patterns of change in a particular phenomenon in terms of its diatopic variation or its variation throughout different textual classes (discourse traditions), for instance.

As I will show, a resource like CORDE allows us to study constructions which appear with infinitesimally small frequency such as, for example, the schema in which an infinitive is placed before the auxiliary verb tener “to hold, to possess,” either with an intermediate object clitic (le in example 1c) or without it (examples 1a and b), and either with a linking preposition de “of”next to the infinitive (example 1a) or without the preposition (examples 1b and c)⁷.

(1) a. non solamente a su coamante de dar tyene, mas a otras çyento ha de contentar

not only to his lover did he have to give, but another hundred were there to satisfy

([Alfonso Martínez de Toledo, Corbacho, 1438 (ms. de 1466)].

b. ¿quién pensar pudiera que así las fuerças de mi propósito enflaquecer tenían?

who would have thought that this is how the forces of my purpose had to get thinner?

(Diego de San Pedro, Arnalte, 1446–1447, 124).

c. no puedo más, seguirle tengo; somos de un mismo lugar

I can do no more, I have to follow him, we come from the same place

(Quijote, II, 33, 906).

The evolution of the use of this syntactic schema is shown in Figure 1 (cf. Octavio de Toledo y Huerta, 2016b) from which we can determine two main points: firstly, whilst this construction was in use (that is, from the middle of the 15th Century to around 1650), it saw sustained growth: see the thickest line on the graph which represents, as a percentage, the proportion with which all variants of the schema appear in a particular time period out of the total number of cases observed⁸; and secondly, between the end of the fourteen hundreds and the middle of the 16th century there was a rapid increase in the use of variants involving no linking preposition and of those including a clitic (see the fine, continuous line and the large dashed line, respectively), thus becoming increasingly analogs to the so-called “analytic future” of the type cantarlo he “I will sing it,” literally “to sing it (I) have,” which obligatorily includes a clitic and has no linking preposition. Furthermore, and in concert with these formal changes, a distributional change also took place as the construction came to be more often found toward the beginning of main clauses, further converging with “analytic futures,” which are almost exclusively found in that position⁹.

FIGURE 1

Figure 1. Percentage values (for each time period) for different schemas and global, relative frequency of all schemas of the Infinitive + tener construction.

In early Classical Spanish, haber “to have” had to a large extent lost its possessive meaning and was being lexically replaced by tener “to hold, to possess.” This substitution had reached modal deontic periprases such as haber de + INF, which was losing ground to tener de + INF (cf. Garachana, 2011). Since also the construction in (1), with tener, increasingly converged with the syntactic behavior of the already available “analytic future” with haber, it seems reasonable to assume that speakers still had some way of recognizing an autonomous auxiliary-like element at the end of cantarlo he as suggested by Anipa (2000). This is in line with the idea – already formulated by Nebrija, no less – that the so-called “analytic future” is not a strange form of “interrupted future” with a disrupting clitic, but rather a periphrasis similar to those formed using other modal verbs, which until 1660 are also commonly “fronted” (i.e., placed before the infinitive) and can be accompanied by clitics, as in the case of decirlo + {debo/puedo/quiero} “say-it (I) {must/can/want}”). After 1600, the whole set of constructions using the infinitive followed by an inflected modal verb practically disappears (cf. Octavio de Toledo y Huerta, 2015a). This mass extinction includes those schemas involving the auxiliary verb tener; yet, as Figure 1 demonstrates, at the moment of its disappearance this construction appears to be gaining frequency rather than showing a decline in usage. In this way, the loss of sequences containing INF (+ clitic) + modal verb (including the “analytic future” cantarlo he) seems to have occurred almost catastrophically (cf. Bernárdez, 1994; López García, 1996, 2011), possibly because the motivations behind this decline do not lie in the gradual loss of enclisis in V1 positions, as has been traditionally argued, but rather in a relatively rapid change in sentence structure, specifically, the information-structural properties of the leftmost (non-peripheral) edge of a (main) sentence, a position from which, from the end of the 16th and the middle of the 17th centuries, non-quantified phrases bearing focus seem to have been excluded (cf. Mackenzie, 2010; Sitaridou, 2011; Sitaridou and Eide, 2014; Batllori and Hernanz, 2015; Batllori, 2016). Be it as it may, it is not without interest – and this is what I would like to underline here – that some of the strongest indications of the periphrastic character of cantarlo he (and hence of its close ties to a wide group of schemas all corresponding to a common basic configuration and disappearing simultaneously) come directly from the very rare group of schemas illustrated in (1), the systematic study of which would not have been possible without the resources of an immense digitized corpus. Thus, the analysis of low frequency phenomena can have an impact that reaches far beyond the merely descriptive and, on occasion, may go on to open doors to the formulation of new hypotheses concerning the evolution of much larger groups of constructions.

Looking more widely, the interest in poorly documented changes will, I believe, also change the way in which we collect data, pushing us to search further afield than our conventional corpora (in this context, conventional is meant in the sense of controlled and closed, with a finite number – however large that might be – of representative sources that have been selected according to certain criteria). It will often be the case that only a handful of data on a particular construction or syntactic schema can be extracted from resources such as CORDE (or its recent upgrade, CDH), the Corpus of Spanish (Corpus del Español), CODEA+, CORDIAM or the search engine provided by the Biblioteca Virtual Miguel de Cervantes (BVMC), to mention only the major corpora available that allow exploring not only Medieval Spanish but also later periods¹⁰. In such situations, researchers may become increasingly inclined to search through databases that are open – i.e., constantly growing in number – and not mediated, in the sense that the works they hold are not selected according to philological criteria, but if anything, rather on bibliological grounds, and contain versions in substantially original format (untouched by a modern editor), like Google Books¹¹. It is possible to find on this platform, for example, dozens of examples which confirm the exclusively eastern¹² character of the prepositional use of bajo + NP (bajo la cama, “under the bed”) in Pre-classical and Classical Spanish [example (2); cf. Octavio de Toledo y Huerta, 2015b], which shows, in turn, that it is not trivial but urgent to increase the efforts to extend dialectal studies on Iberian ground to include all available printed sources and into the modern era.

(2) Mi marido está baxo la cama

My husband is under the bed

(Exemplario contra los engaños y peligros del mundo, Zaragoza, 1493).

Estavan baxo el árbol confundidos hombres y brutos

They were under the tree, men and beasts all together

(Baltasar Gracián, Criticón, II, 205).

On the other hand, in the wake of the pioneering work of scholars such as Morala (2002) and García de Paredes (2011), there has been a growth in studies looking at low frequency phenomena by using the broadest and most unrestricted corpus available, i.e., the Internet, a particularly useful move whenever a phenomenon’s low incidence in standard corpora may be connected to its evident diatopic or diaphasic markedness, which will tend to preclude its appearance in corpora largely dominated by works conforming to the linguistic standards of highly elaborated literature. The Internet, of course, yields extremely multifarious data, the correct discrimination and contextualization of which requires huge philological effort. However, it is not difficult to find pearls in this deep electronic ocean where the consultation of conventional corpora only offers a glimpse of a tantalizing phenomenon. For instance, whilst CORDE finds just one example of the quantifier algotro “some other (thing or person)” (3), by using the geographical associations provided by data from Google¹³, we find that it is a dialectal feature of Extremadura (whence Felipe Trigo came) and western La Mancha (rather than of Andalucia, pace the Real Academia Española’s dictionary).

(3) Unas cosa las vide yo mesmo, por mis ojo; algotras de endenantes, y de las que hición los tres en la ermita con aquellas probe

Some things I saw myself, with my own eyes; some others a while ago, and what the three of them did in the chapel with those poor women

(Felipe Trigo, Jarrapellejos, 1914).

Turning (just once) to the lexicon instead of syntax, the diminutive form mengajo (originally “rag,” then also “little child”) is assigned a Murcian origin in the old and venerable Diccionario de Autoridades (1726–1739), but neither CORDE/CDH nor their contemporary counterparts, CREA/CORPES XXI, contribute one single example; its persistence as a south-eastern hallmark is confirmed via Google searches, which also reveal the contemporary spread of this term into the eastern regions of La Mancha. Map 1 shows the prevalence of algotro (toward the west, circles, and triangles) and mengajo (toward the southeast, ovals, and squares) as found using Google (the basic parameters, restrictions, and limitations of the searches have been detailed elsewhere: cf. Octavio de Toledo y Huerta, 2016c).

MAP 1

MAP 1. Spanish results that can be localized geographically from Google searches for algotro “some other” (circles and triangles) and mengajo “rag, little child” (ovals and squares), apud Octavio de Toledo y Huerta, 2016c.

These two examples suffice, in my opinion, as evidence of the degree to which these new data sources will facilitate our access to linguistic phenomena difficult to trace until now. This is likely to modify our views on phenomena still disregarded as residual or marginal, thus shining a light on the evolution of broader processes: the western distribution of algotro, for instance, perfectly coincides with that of the demonstratives estotro/esotro “this other/that other” (cf. Octavio de Toledo y Huerta, 2018a), similarly formed on otro “another,” and at the same time with the western origins of alguien “somebody” (Malkiel, 1948; Pato, 2009), another indefinite quantifier which uses the prefix alg-, as well as with the presence in Galician and Portuguese of other indefinite lexemes containing alg-, such as algures “somewhere,” and on the other hand with the historical scarcity or absence of examples of Sp. And Port. algo “something” toward the eastern territories (note that standard Catalan displays but one alg- form, as present in algú/alguna cosa “someone/something,” cf. Sp. algún, port algum). Thus, both a full-fledged system of indefinites with alg- and the formation of pronominal compounds with otro appear to be western features, as confirmed by the decidedly western status of algotro, where both the alg- radical and the compounding procedure converge.

With respect to establishing more precise evolutionary curves, it is not my intention here to discuss technical improvements, such as the possibilities for multifactorial analysis brought along with the programming language R (cf. a.o. Gries, 2009; Bivand et al., 2013: 151–166, Arnold and Tilton, 2015 or Levshina, 2015) or refinements in quantitative approaches, particularly in the area of inferential statistics [cf., with particular reference to the history of Spanish, the splendid book by Rosemeyer (2014)]. I will instead focus on a complementary aspect of this endeavor which may have greater theoretical importance. Works such as Rosemeyer (2014) analysis of the extinction in Spanish of the non-passive constructions with ser “to be” and a past participle, Marco and Marín (2015) remarks on the expansion of estar “to stand, to stay, to be” + participle or the already classic study by Rodríguez Molina (2004) about the diffusion of haber “to have” + participle all clearly show, through a wealth of data, that the progression or regression of these auxiliaries is a function of their progressive adoption or rejection by specific groups of predicates. For example, participles that convey the continuity or emergence of an event, such as permanecer “to remain” or suceder “to happen,” lose their ability to combine with ser at an earlier stage than others (e.g., participles indicating change of state); the participles of verbs of transfer, like dar “to give,” become associated with auxiliary haber “to have” sooner and more frequently than with other verbs; and the huge increase in the use of estar + participle in Classical Spanish (roughly, the 16th and 17th centuries) largely coincides with its adoption by psychological predicates involving an experiencer, such as preocupar “to worry.” In all these cases, the observed diffusion or regression of the constructions follows a logistic curve, or S-curve (cf. Kroch, 1989; by means of example, Figure 2 shows the decline in usage of ser + resultative participle), and it does seem that such a curve is indeed characteristic for syntactic change that involves a form of diffusion (whether progress or decline) mediated by lexical permeability.

FIGURE 2

Figure 2. Evolution of the relative frequency of the construction ser “to be” + resultative participle (from Sánchez Marco, 2012: 99, reproduced by kind permission of the author).

It might certainly be the case that the S-curve is actually the only function which can properly model the diffusion of syntactic change, as asserted by Blythe and Croft (2012) (cf. also Nevalainen, 2015 or Feltgen et al., 2017)¹⁴, who blame the peculiar curvature on how a community of speakers evaluate a linguistic variant in terms of its prestige within their social group, regardless of the structure of said social group.

The overall structure of the trajectory of a language change [is] an S-curve, no matter how it [is] propagated through grammatical contexts, words, speakers, texts, geographical regions, or social classes. This overall trajectory appears to be determined by differential weighting of variants (replicator selection) (Blythe and Croft, 2012: 294).

Besides lexical diffusion (i.e., “through words”), these authors explicitly mention a further mode of diffusion via syntactic patterns (i.e., “through grammatical contexts”). The latter would appear to be in action in the expansion of the shorter variant hemos at the expense of the older form habemos (both meaning “we have”), which unfolds along another sinuous S-curve principally during the Classical Spanish period (Figure 3; cf. Bustos and Moreno, 1992; Rodríguez Molina, 2012). The process clearly appears to have been guided by the progressive expansion of the variant hemos out of its natural environment of origin in Old Castilian (as part of the “analytic future,” i.e., the INF + Clitic + AUX (have) construction: cantarlo hemos “we will sing it”) to the formally and, most importantly, semantically related deontic periphrasis haber de + INF “have to + INF,” and there from, in successive waves, to the perfect tenses with haber + Past Participle “have PP,” less related to the source constructions both in form and meaning. This is suggested by the data in Table 2, the result of an exhaustive search in CORDE for 1500–1530, the period when the shorter variant hemos saw its initial boom. In global terms, Table 2 shows that hemos and habemos occur with similar frequencies (50% of the total for each form) within the corpus; however, the variant hemos shows the greatest affiliation to haber de + INF (70% of examples of this periphrasis use the shorter variant, while it is the preferred form in only 41% of cases of haber + PP), whereas the variant habemos is shown to be largely preferred in non-auxiliary contexts (where haber is used as a verb of possession), clearly more distant from the periphrastic futurate construction where short hemos originated.

FIGURE 3

Figure 3. Change in relative frequency (expressed as a percentage) for hemos (rising curve) vs. habemos (descending curve) throughout Classical Spanish.

TABLE 2

Table 2. Frequencies of use (total instances and as a percentage) of hemos and habemos in different syntactic contexts.

Rosemeyer (2016), for example, also observes an expansion through syntactic contexts in the competition between ser + PP and haber+ PP with reflexive predicates¹⁵. As De Smet (2012, 2013) suggests, such processes evidence the importance of extension via similarities between related syntactic contexts during the enactment of a change taking place over the medium or long term. However, of most interest here is that it is not always certain that this type of syntactic extension will behave in exactly the same way as lexical extension in terms of its diffusion. In fact, the curve shown on Figure 3 does not at first glance show the typical characteristics of an S-curve, beginning and ending with a shallow gradient while showing a much higher rate of change in its intermediate range. On the contrary, its central region suggests a period of relative stability after an initial surge and it finishes on a similarly steep trajectory (although, of course, both a slow inception before 1450 and a slow tapering-off after 1650 can be reasonably assumed)¹⁶. The reason for this may lie in the fact that the S-curve is a natural feature of lexical diffusion: initially, very few words will adopt a change; then, during the intermediate phase, given that the semantic connections between lexical elements form a complex network, groups of interconnected words will add together in a cascade with a cumulative, snow-balling effect; finally, only a few isolated areas of resistance remain, which explains the slow trailing off of the last phase. However, purely syntactic context expansion (cf. Himmelmann, 2004: 32–33) may follow a more irregular pattern: at some point the variant may come into use in several contexts simultaneously (or successively, but with very short time intervals between each), hence expand at a very high rate. However, once all the available areas of use have been accessed, its progress may stagnate as, in contrast to lexical diffusion, it does not receive an impulse from a sustained adoption on the part of a growing paradigm class (host-class expansion; cf. again Himmelmann, 2004: 32–33). The existence of a final stage of accelerated mutation – i.e., the abandonment or diasystematic isolation of one of the competing variants, as described by Coseriu (1983: 55) – is in all probability the result of a (half-)conscious, socially motivated bias on the part of speakers. Certainly, Blythe and Croft (2012) proposal refers not only to “pure” S-curves but also to any course of development compatible with such curves (Blythe and Croft, 2012: 293): the trajectory in Figure 3 could be seen as a two-staged S-curve made up of two successive S-curves, the second of which includes a significant phase of initial delay or “latency” (Feltgen et al., 2017)¹⁷. Such irregularities in the shape of S-curves, however, could be indicative of a differential intervention of endogenous vs. exogenous factors of change (cf. Ghanbarnejad et al., 2014), or of lexical vs. syntactic extension, or both. In any case, the study of further trajectories corresponding to other examples of syntactic extension should help narrow down the extent to which the observed differences can be generalized to a wider group of linguistic developments.

The establishment of an ever increasing set of frequency curves of sufficient precision invites comparison between them, an exercise that might uncover correlations poorly studied until recently. The GRADIA project, to which I belong¹⁸, has investigated the development of a wide group of modal periphrases, as shown in Figure 4. Interestingly, the increased use of tener que+ INF “to have to,” a periphrasis traditionally blamed for the regression of haber de+ INF, does not seem to directly prejudice the use of this construction until the dawn of the 20th century, which can be taken as an indication that, at first, these two constructions did not compete excessively to express the same values (tener que+ INF emerges as a clearly obligation periphrasis: cf. Garachana, 2016). On the other hand, both the increase and the decline in usage of haber de+ INF “have to” are inversely related, from 1500 onward, to the curves showing the use of deber (de)+ INF “must, ought to,” a fact which suggests that haber de+ INF became engaged in Classical Spanish in a competition to express not only deontic, but also epistemic values, since deber (de) + INF could convey both (cf. Garachana and Hernández Díaz, 2017).

FIGURE 4

Figure 4. Evolution of the relative frequencies (per million words) of nine Spanish modal constructions with an infinitive, all of them deontic except with parecer “to seem” (elaboration by Malte Rosemeyer using data from various members of the GRADIA project: reproduced by kind permission of the author). The data were recovered from the GRADIA corpus: http://gradiadiacronia.wixsite.com/gradia/corpus-gradia.

Thus, Figure 4 helps us to understand the importance of co-evolution in groups of constructions that are similar both in form and meaning: a specific trajectory might be accelerated (or slowed down) by the appearance of others within its local variational environment or “envelope of variation,” that is, its constructional network (for analogical effects of attraction and differentiation within such networks, cf. now De Smet et al., 2018). The effects of co-evolution are also felt in the case of the periphrasis with fronted infinitive, for example, cantar(lo) tengo (see Figure 1 above): the curve shows how the presence of clitics within the schema grows significantly until the middle of the 16th century, which, as already stated, clearly indicates a convergence with the “analytic future” cantarlo he, a construction in which the clitic is obligatory. However, this tendency never reaches completion; instead, the curve levels off and even shows a clear decline in the 17th century, most probably due to the fact that the construction with tener departs from the model with haber (which was receding at great speed under the competence of the “synthetic” solution cantarelo “I will sing it”) and becomes attracted to analogous sequences using the auxiliaries deber “must, ought to,” poder “can, be able to” and querer “want to,” in which the clitic can be used but is not obligatory (Octavio de Toledo y Huerta, 2016b). In any case, what is important here is that the formulation of these hypotheses emerges directly from the comparison of evolutionary paths, whether they describe the coevolution of a whole network (Figure 4) or the syntactic properties of a single given phenomenon (Figure 1). Without observing these frequency curves, it would have been far more difficult for scholars to implement these new possibilities of analysis.

On the other hand, the trajectory of haber de + INF “to have to” (Figure 4) makes us wonder whether phenomena that become recessive at a certain point do always follow a descending S-curve even during this phase of recession (cf. e.g., Figure 2). Blythe and Croft (2012) do not deal with this type of change:

There are […] changes in our survey that appear to stop and go in reverse. These may be interpreted as changes following an S-curve trajectory that are then interrupted; we do not analyze such changes here (Blythe and Croft, 2012: 279).

Thus, like many other scholars, Blythe and Croft (2012) concentrate only on changes resulting in successful diffusion. The tendency to focus purely on the ascendant phase of certain changes is very common, for example, with remarks on the relationship between grammaticalization and frequency:

As long as frequency is on the rise, changes will move in a consistent direction […]. When a grammaticalization construction ceases to rise in frequency, various things happen, but none of them is the precise reverse of the process (Bybee, 2011: 77).

The increase in frequency, then, is a symptom of grammaticalization, but we are still none the wiser as to how to interpret the downturn in terms of that very same model of morphosyntactic change. Given that there is no need to assume that the changes which become generalized [i.e., those that reach Coseriu (1983) stage of mutation] will be more abundant than those that fail to develop part way through their trajectory, it seems obvious that the studies of grammaticalization have still not managed to produce an unbiased model of diffusion, that is to say, one not restricted to the time period over which the grammaticalized element or schema is seen to expand¹⁹.

However, not only the recessive phases create doubts about the generalizability of S-curves²⁰. The expansion in use of the article before subordinate clauses headed by que “that” (Figure 5; cf. Lapesa, 1984; Herrero, 2013; Octavio de Toledo y Huerta, 2014b) shows a pattern of diffusion that is difficult to fit to a function of this type. This phenomenon’s explosive surge rather conforms to an exponential curve, with a prolonged and shallow initial curve followed by a brisk upturn and without a third phase of moderate growth (note that the phenomenon becomes regressive after reaching its maximum value, gently falling back in compliance to the S-curve pattern).

FIGURE 5

Figure 5. Diffusion of the use of the article (el “the.M.SG”) before clauses introduced by que “that” presented as a weighted frequency by time period. Data from CORDE (Octavio de Toledo y Huerta, 2014b).

Blythe and Croft (2012) explicitly disregard the existence of such trajectories, at least in the case of competing constructions of the kind envisaged by Kroch (1989) and themselves:

To our knowledge there are no clearly documented cases of a change going toward completion that follows […] an exponential curve (either slow start with a rapid completion and no tapering off, or an immediate rapid increase followed by a slow completion rate) (Blythe and Croft, 2012: 280).

But the construction in Figure 5 emerged as an alternative to other complementizer schemas without displacing any of them. It is worth noting here that we are dealing with a rather uncommon kind of curve, its main interest being that it invites reflection on whether this form of diffusion results from some special circumstances. In my opinion, the answer could be yes. The phenomenon in Figure 5 is best explained as an extension in the use of the article as a syntactic marker from a similar and pre-existent construction in which the article precedes an infinitive clause (where the infinitive exhibits a clear verbal value: cf. Torres, 2009). As shown in Figure 6, the “contagion” of the article to subordinate clauses introduced by que “that” occurred when its use with the infinitive clause was at its height (indicated with the lightest colored bar on Figure 6); when this schema enters its decline, the derived construction with el + que also decays. The growth of the schema in Figure 5 can be thus seen to rely on the success of another semantically similar schema serving as a supporting construction (cf. De Smet and Fischer, 2017). The phenomenon is by no means unique: it can be found in similar examples, such as the semantic extension of sino es “if not, but” from an exceptive meaning (all were tall but Paul) to becoming a corrective adversative linking sequence (Paul was not tall, but short: cf. Octavio de Toledo y Huerta, 2008). As Figure 7 shows, the adversative value starts to flourish quite abruptly at the same time as its exceptive use also reaches its apex and subsequently dies away, following an S pattern, at the same rate as the parent schema.

FIGURE 6

Figure 6. Article placed before infinitive clauses (data from CORDE, infinitives beginning with a- and r-).

FIGURE 7

Figure 7. Evolution of weighted frequencies for sino es “if not, but”: (A) with an exceptive meaning (sino es 1: no se casan sino es con permiso, “they won’t marry but with permission,” and (B) with corrective adversative meaning (sino es 2: no son pobres sino es ricos, “they are not poor but rich”). Data from CORDE.

We might be confronted here with a specific form of diffusion that could be termed as “parasitic,” given the dependence of the derived construction upon the schema on which it is based. This type of extension to new semantic values or syntactic schemas appears to be typical of secondary grammaticalization (that which affects elements or sequences that already have a grammatical value; cf. especially Norde, 2012; Breban, 2014; Killie, 2015) and could produce very abruptly raising curves, which would thus be symptomatic of the mode of expansion and ensuing regression that Haspelmath (2004) terms retraction, i.e., the appearance and subsequent elimination of a function – in the sense of a new form-meaning pairing – toward the end of a grammaticalization chain (for the characteristic structures of such chains, cf. Heine, 1992). In all events, the formulation of this hypothesis, which surely needs further proof, is once again made possible by the observation of correlations between curves describing the trajectories of related phenomena.

The comparison of cognate trajectories is the strategy behind Postma (2010) claim that a low-frequency, aborted change can still fuel other changes with greater impact. For instance, the superposition of the curve showing the usage of the definite article before Wh-questions and that showing the use of the article before a subordinate clause introduced by que “that” (Figure 8) confirms [Lapesa (1984): 542–543] suggestion that the former schema, although short-lived and never too frequent, could stimulate the expansion of the latter (cf. Octavio de Toledo y Huerta, 2014b). Moreover, the rise of these clauses where an article precedes a complementizer que (el + que) could have buttressed, (according to Girón, 2004) the emergence of the homophonic sequence el que “which,” a newly grammaticalized compound relative pronoun formed with the article and a relative que (Figure 9). This possibility, however likely, naturally puts forth an additional question about the role of the replication of sequences already familiar to the speaker – i.e., sequences that are part of his competence and can serve as model for the processing and production of new sequences– as a triggering mechanism of grammatical change (entrenchment via priming: cf. e.g., Szmrecsanyi, 2005; Jaeger and Rosenbach, 2008; Schmid, 2016; for some other historical phenomena in Spanish, cf. Torres, 2015; Mackenzie, 2017; Rosemeyer and Schwenter, 2017 in print).

FIGURE 8

Figure 8. Weighted frequency curves showing the constructions ART+Wh (me explicó el cómo lo hacían, “he explained ART how it was done”: lower curve) and ART+C (te agradezco el que vengas, “I am grateful to you ART that you came”: curve with peak frequency in 18th Century). Data from CORDE.

FIGURE 9

Figure 9. Weighted frequency curves of ART+C (striped band), non-oblique (subject/object) compound relative pronouns (solid band), and oblique relative pronouns (band with light dots on dark background). Data from CORDE.

Finally, with regards to the opportunity for improving our diasystemic (or variational) characterization of linguistic changes²¹, large corpora allow us to, once again, make important progress in little time. Thus, if Rodríguez Molina (2012) was able to demonstrate, through the use of a substantial corpus which he painstakingly gathered himself, that the early (i.e., before 1450) reduction in the use of the longer variant habemos in favor of shorter form hemos in compound tenses is an eastern phenomenon (Map 2A), the large Ibero-Romance corpora now available online allow us to establish in a few minutes that this decline was also early toward the west (in the possessive uses and in the modal periphrasis with the infinitive, since compound tenses are not used in these varieties of Spanish: Map 2B), and that the restriction of shorter form hemos to analytic futures until the middle of the fourteen hundreds is, therefore, a phenomenon unique to central Castilian Spanish.

MAP 2

MAP 2. (A,B) Points where the shorter form hemos is found before 1450 according to Rodríguez Molina (2012) (top) and additional points established through consultation of the online corpora CORDE, CODEA+, TMILG, and CICA (bottom: western data is indicated with triangles).

Now, the study of how a change is adopted across discursive traditions is not only a methodological requirement of any investigation that seeks to take a non-trivial historical perspective²², but it can also provide the key for explaining the propagation of a given phenomenon. Such is the case with the schema in which the negative qualifier nada “nothing” is placed before the main verb, as in nada sé “nothing I know,” rather than the more common negative agreement construction, no sé nada “NEG I know nothing.” This schema expands and declines in use in accordance with the degree of acceptance of a syntactic rule imported into cultivated Spanish prose from the written tradition and grammar of Latin – a language with no negative agreement and typically preverbal negative quantifiers (nihil scio “Nothing I know”). As discussed elsewhere (Octavio de Toledo y Huerta, 2014a), the adoption of this rule can be easily tracked from its initial success amongst the first Spanish humanists in the fourteen hundreds, through highly elaborated works in a variety of genres, until the Romantic era, where, as part of a rejection of the classical rhetorical paradigm, the schema starts to fall into disuse, a trend which has continued to this day (Figure 10). This phenomenon illustrates the qualitative and quantitative gain derived from careful study of the discourse-traditional characteristics displayed by linguistic items and constructions (cf. the German adjective diskustraditionell: Kabatek, 2015; Winter-Froemel et al., 2015; Varga, 2017; Octavio de Toledo y Huerta, 2018b), which more often than not signals a unique trajectory for each phenomenon throughout its recorded textual history²³. This endeavor, that departs from the itineraries traced by the phenomena themselves, seems more profitable to the interests of historical morphosyntax than that of establishing what might be called a taxonomy of discursive traditions and attempting to ascribe to them certain (allegedly) characteristic morphosyntactic features.

FIGURE 10

Figure 10. Proportion of appearances of the schema nada sé (as a percentage of the sum of these cases and those of the type no sé nada: % antep.), proportion of texts where nada “nothing” appears before the verb in more than 50% of cases (% tx ant. >50%) and proportion of texts where there are no cases of nada “nothing” before the verb (% ant. = 0%). Data from CORDE.

I am well aware that I have presented more dilemmas and possibilities than solutions throughout this paper. However, I want to finish by introducing yet another challenge: perhaps the time is ripe already to start building the foundations for a classification of linguistic diffusion, distinguishing, for example, the regular S-curves, which one expects in the case of diffusion linked to competing pairs, from other curves that are possibly less uniform in their final stages such as those associated with diffusion via syntactic extension, or from ascending phases that do not seem to follow a logistic curve, and considering not simply the dynamics of these phases of rampant ascent but also regressive phases, which have been rather neglected until now. Indeed, we must attempt to profile the specific forms of linguistic diffusion that fit specific manifestations of linguistic change, as in the class of propagation we have termed “parasitic,” whose relationship with expansion and retraction at the extremes of grammaticalization chains places it in opposition to, for example, forms of propagation in which the appearance of a new schema diffuses at the expense of similar forms or constructions that preceded it, which it ultimately replaces. This is exactly what happened in the evolution of the temporal conjunction ínterin “while,” which from its first appearance replaces, at an ever-increasing rate, its competitor ínterin que “meanwhile that,” which in turn had previously ousted the equivalent conjunctive phrases en (el) ínterin que “in (the) meantime that” (Figure 11; cf. Octavio de Toledo y Huerta, 2007). This form of diffusion could maybe be dubbed “phagocytic” or “cannibalistic”. In this way, it becomes possible to start a catalog of the specific forms of change corresponding to particular forms of linguistic diffusion²⁴.

FIGURE 11

Figure 11. Evolution of relational uses of ínterin “while” [lo acabaremos (en (el)) ínterin (que) los demás llegan “we will finish it (in (the)) meantime the others arrive”]: Relative frequency percentages for each construction within each period. Data from CORDE.

Possibly, the best way to reach valid generalizations in research on linguistic diffusion is through trial and error. The process is thus fraught with error, but nevertheless worth the effort: the alternative would be to remain content with current remarks of diffusion, such as the increasing frequency of grammaticalized elements and constructions or the (alleged) universality of S-curve trajectories in cases of competing minimal pairs, which, as we expect to have illustrated, fall short of accounting for all the interesting ups and downs observed during propagation, and are in addition neither well suited to describe the final phases (regressive or otherwise) of trajectories, nor allow us to correlate types of change and modes of diffusion²⁵. Furthermore, researchers of S-curve effects, despite typically attributing its peculiar sinuosity to social interaction between speakers, do not seem particularly inclined to a more detailed exploration of how the diffusion of individual changes manifests in distinct variational dimensions [as described e.g., by Koch and Oesterreicher (1990/2011)], or in how much they can be attributed to concrete textual traditions. The relentless regularity of the S-curve has a distinct appeal in its elegant simplicity and uniformity, but, for many historians of language, it also arouses the desire to transcend its monotony in search of the unstable, brittle heterogeneity that is in all probability as intrinsic to the diffusion of linguistic changes as to all other social activities engaged in by human beings.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Funding

The work presented here was possible thanks to funding from the “Ramón y Cajal” Research Programme and from the Spanish National Projects. “Diccionario histrico de las perífrasis verbales del español: Gramática, pragmática y discurso (II). Perífrasis temporales y aspectuales” (FFI2016-77397-P) and “Procesos de gramaticalización en la historia del español (V): Gramaticalizaciõn, lexicalización y análisis del discurso desde una perspectiva histórica (FFI2015-64080-P),” both financed by the Ministerio de Economía y Competitividad.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

I would like to thank my colleagues Maria Josep Cuenca, Mar Garachana, and Javier Rodríguez Molina for their insightful commentaries on the first draft of this work.

Footnotes

^https://ahle.webnode.es/
^Cano (1995: 324) already noted the preponderance of papers dedicated to historical morphosyntax in the proceedings of the first two of these conferences, a dominance which has been maintained and indeed increased in subsequent proceedings. According to Concepción Company Company (2017), in the last nine proceedings up until the present, of the three major areas of study, morphosyntax was the subject of 29% of contributions (437/1514), whilst the following two, Lexicology and semantics, on one hand, and Historiography and history of the language, on the other, received far less attention: 22% (339/1514) and 19% (286/1514) publications, respectively.
^www.rae.es
^Concerning these topics, cf. Lucía (2003) and more recently, Garachana and Artigas (2012), Kabatek (2012, 2016), Lleal (2013), Octavio de Toledo y Huerta (2016a), or Octavio de Toledo y Huerta and Rodríguez Molina (2017). However, these problems occur to a much lesser degree in CORDE than in other competing corpora. This applies particularly to Mark Davies’ Corpus del Español, which offers clear benefits in terms of searchability, but is far less complete and includes texts of a (much) lower philological quality. For a critical comparison between CORDE and the Corpus del Español, cf. Davies (2009) and Rojo (2010); also of interest for an analysis of certain specific phenomena on both corpora are Nieuwenhuijsen (2009) and García Salido and Vázquez Rozas (2012). Concerning the general characteristics of CORDE, cf. also Sánchez and Domínguez (2007) and Rojo (2012), and for its use as a tool for syntactical analysis, cf. e.g., Sánchez Lancis (2009); Buenafuentes and Sánchez Lancis (2012), and Octavio de Toledo y Huerta (2016a).
^Another 6 studies (17%) use Davies’ Corpus del Español, half of which (3/6) also use CORDE. There are 4 studies (11%) based on CODEA and 2 (5.5%) which rely on Biblia Medieval (one of these also uses CODEA). Only two studies use the holochronic methodology of selecting only two or three works per century, in one case this approach is supported by CORDE. Of those studies completed without recourse to an electronic corpus, the vast majority (6/9) are exclusively focussed on medieval Spanish, a fact which shows how studies which, exceptionally, do not use this type of corpus tend to adhere to the most traditional methods of data collection and are concerned with the language before the 1500’s. For the purposes of comparison with Table 1, the majority of contributions concerning morphosyntax at IX CIHLE are holochronic studies (17/36, 53%): these are followed by the studies focussing on the Middle Ages (14/36, 53%) and the remaining few are, as ever, those dedicated to Classical Spanish (3/36, 8%) and Modern Spanish (2/36, 6%). This confirms the findings concerning recent trends as shown on Table 1.
^For further criticism of the “darker” aspects in conducting research with CORDE, I refer the reader to Octavio de Toledo y Huerta (2014a, 2016a), Fernández Alcaide et al. (2016), or Octavio de Toledo y Huerta and Rodríguez Molina (2017).
^Three hundred and sixty-seven cases were retrieved through an exhaustive search of CORDE; cf. Octavio de Toledo y Huerta (2016b).
^Thus, this line reflects the fact that in the 1485–1524 period, for instance, only around 5% of the total tokens are found, whereas almost 45% of the tokens concentrate in the last described period, 1605–1660.
^For an analysis of the syntactic behavior and other properties of “analytic” futures (and conditionals), cf. Castillo (2002), Concepción Company Company (2006), Girón (2007); Bouzouita (2011), Octavio de Toledo y Huerta (2015a), and Batllori (2016).
^For a more complete list of digital libraries available on the web, cf. Kabatek (2016: 15–16).
^This digital portal often contains all those ancient books from an individual library that its librarians have decided to scan based on the book’s accessibility, state of preservation, date or place of publication, adscription to a certain collection or series, material value, and rarity, etc., without particular reference to the “literary” quality of these works. This clearly results in a much more diverse selection than usual in standard, closed diachronic corpora. For an analysis of the effects of esthetic prejudice on the preference for certain periods and works, which in turn influences editorial practice and thus the set of texts available in a philologically reliable form, cf. Pons (2006) and Montaner (2011). The main characteristics of a corpus that is authentically representative as suited to the needs of historical linguistics have been described by Kabatek (2013).
^i.e., Characteristic of the eastern half of the Iberian Peninsula.
^As an reviewer rightly points out, empirical research conducted directly on Google and similar platforms raises issues concerning the representativity of the data obtained, particularly when searching for competing variants. I address some such issues in Octavio de Toledo y Huerta (2016c), as do other contributions in that special number.
^A parallel ongoing discussion concerns the universality of the Constant Rate Hypothesis (CRH), equally introduced by Kroch (1989), which predicts that syntactic change takes place at the same pace across different contexts: cf. now Kauhanen and Walkden (2018) for a refined version of the CRH.
^“The expansion in the frequency of use of haber + PP in the period 1425–1524 was completely dependent on the expansion of haber + PP to new syntactic contexts. The relevance of this result lies in the fact that it permits the formulation of a hypothesis about the causes for the substitution of ser + PP by haber + PP in Spanish: our analysis suggests that the expansion of haber + PP into contexts of use where ser was previously used was caused by the syntactic factor of reflexivity” (Rosemeyer, 2016: 499 [my translation, ÁOdT]).
^Of course, a slow inception of the curve before 1450 and a slow tapering-off after 1650 can be reasonably assumed: my focus here is on the rather unexpected quick-slow-quick pace in the central stages of the diffusion. The possibility of a difference between lexical expansion and other types of diffusion was already suggested by Denison (2003).
^Alternatively, the first curve could correspond to the type that, instead of a tendency to generalization in the final stages, leads to “reasonably stable variation with the variants fluctuating around a mean percentage value” (Blythe and Croft, 2012: 280), a configuration which these authors subsume into the general category of the S-curve. Note that, consequently, the form taken by the progression of the final generalization, should there be any, is irrelevant to the consideration of a trajectory as an S-curve.
^See the project’s web page at: http://gradiadiacronia.wixsite.com/gradia.
^Incidentally, as indicated by Nevalainen (2015), in contrast to the phase of expansion in which the presence of S-curves is expected, their presence in the recessive phase requires additional explanation from a sociolinguistic point of view: “If the outcome is expected (with the benefit of hindsight, for example), the diffusion of linguistic change along an S-shaped curve does not necessarily call for an explanation, but a change reversal normally does.”
^I deliberately leave aside those phenomena that Concepción Company Company (2017) dubs “continuous changes,” that is, those that do not display noticeable frequency alternations throughout their evolution. For other objections to the alleged universality of the S-curve as regards the diffusion of linguistic changes, cf. Denison (2003) or Winter-Froemel (2014).
^On the theoretical necessity of doing so, cf. Fernández-Ordóñez (2011) for diastratic Eberenz (2009) or Kabatek (2012) for diaphasic considerations.
^Suffice it to quote the much missed Wulf Oesterreicher: “questions about innovation strategies and so-called paths of grammaticalization should always be followed by questions concerning the discursive avenues of diffusion and successive adoption of these innovations by speakers” [Oesterreicher, 2006: 146 (original emphasis; my translation, ÁOdT)].
^Interest in the interaction between frequency and the preference of individual phenomena for certain (groups of) texts and textual forms is breaking through in historical linguistics from several different methodological standpoints [cf. e.g., Szmrecsanyi, 2016 or Simonenko et al. (2018)]. The discourse traditions (or DT) framework (as expounded e.g., in Kabatek, 2017) offers, however, a useful specific set of theoretical notions and insights with which to track historical routes of diffusion across texts.
^I expand on this proposal in more detail in Octavio de Toledo y Huerta (2016a).
^In other words, “S-curves provide only a very coarse-grained description of the spreading of linguistic innovations in a population,” but S-curves themselves “can be used to discriminate between different mechanistic descriptions and to quantify the importance of different factors known to act on language change” (Ghanbarnejad et al., 2014: 8b).

References

Academia Mexicana de la Lengua (2014). Corpus Diacrónico y Diatópico del Español de América. Available at: http://www.cordiam.org/

Andrés Enrique-Arias (2004). Corpus Biblia Medieval. Available at: http://www.bibliamedieval.es/

Anipa, K. (2000). A study of the analytic future / conditional in Golden-Age Spanish. Bull. Hisp. Stud. 77, 325–338.

Large Corpora and Historical Syntax: Consequences for the Study of Morphosyntactic Diffusion in the History of Spanish

Author Contributions

Funding

Conflict of Interest Statement

Acknowledgments

Footnotes

References

95% of researchers rate our articles as excellent or good