Cross-Domain Statistical–Sequential Dependencies Are Difficult to Learn

Walk, Anne M.; Conway, Christopher M.

doi:10.3389/fpsyg.2016.00250

ORIGINAL RESEARCH article

Front. Psychol. , 25 February 2016

Sec. Cognition

Volume 7 - 2016 | https://doi.org/10.3389/fpsyg.2016.00250

Cross-Domain Statistical–Sequential Dependencies Are Difficult to Learn

$\r\nAnne M. Walk$ Anne M. Walk¹

Christopher M. Conway^2*

¹Neurocognitive Kinesiology Lab, Department of Kinesiology and Community Health, University of Illinois at Urbana-Champaign, Urbana, IL, USA
²NeuroLearn Lab, Department of Psychology, Georgia State University, Atlanta, GA, USA

Recent studies have demonstrated participants’ ability to learn cross-modal associations during statistical learning tasks. However, these studies are all similar in that the cross-modal associations to be learned occur simultaneously, rather than sequentially. In addition, the majority of these studies focused on learning across sensory modalities but not across perceptual categories. To test both cross-modal and cross-categorical learning of sequential dependencies, we used an artificial grammar learning task consisting of a serial stream of auditory and/or visual stimuli containing both within- and cross-domain dependencies. Experiment 1 examined within-modal and cross-modal learning across two sensory modalities (audition and vision). Experiment 2 investigated within-categorical and cross-categorical learning across two perceptual categories within the same sensory modality (e.g., shape and color; tones and non-words). Our results indicated that individuals demonstrated learning of the within-modal and within-categorical but not the cross-modal or cross-categorical dependencies. These results stand in contrast to the previous demonstrations of cross-modal statistical learning, and highlight the presence of modality constraints that limit the effectiveness of learning in a multimodal environment.

Introduction

Many organisms have the ability to detect invariant patterns and associations from a seemingly chaotic environment. One such ability, statistical–sequential learning, involves the learning of statistical patterns across items presented in sequence (Saffran et al., 1996; Daltrozzo and Conway, 2014). Statistical learning appears to be central to the development of many cognitive functions, especially language (Saffran et al., 1996; Conway et al., 2010; Misyak et al., 2010; Nemeth et al., 2011; Arciuli and Simpson, 2012; Kidd, 2012; Misyak and Christiansen, 2012). Traditionally, statistical learning has been studied in a unimodal manner, presenting participants with stimuli to a single sensory modality, such as audition, vision, or touch (Saffran et al., 1996; Fiser and Aslin, 2001; Kirkham et al., 2002; Conway and Christiansen, 2005). However, in many natural circumstances, such as spoken language, multiple sensory modalities are involved. For example, sighted individuals make extensive use of visual facial information, such as the movement of the mouth, to aid in speech perception (Rosenblum, 2008).

Despite the importance of multisensory integration in language processing and other areas of cognition, only recently has multisensory integration been investigated in the context of statistical learning. Toward this end, Mitchel and Weiss (2011) presented unimodal auditory and visual input streams simultaneously to participants and manipulated the audiovisual correspondence across the two modalities. They found that learners could extract the statistical associations in both input streams independently of the other (consistent with the findings of Seitz et al., 2007) except when the triplet boundaries were desynchronized across the visual and auditory streams. In such conditions, learning was disrupted, suggesting that statistical learning is affected by cross-modal contingencies. Other studies have similarly shown that input presented in one modality can affect learning in a second concurrently presented modality. For instance, Cunillera et al. (2010) showed that simultaneous visual information could improve auditory statistical learning if the visual cues were presented near transition boundaries (see also Robinson and Sloutsky, 2007; Sell and Kaschak, 2009; Mitchel and Weiss, 2010; Thiessen, 2010). More recently, Mitchel et al. (2014) used the McGurk illusion to demonstrate that learners can integrate auditory and visual input during a statistical learning task, suggesting that statistical computations can be performed on an integrated multimodal representation.

Although these studies are all clear demonstrations of multimodal integration during statistical learning tasks, they use concurrent auditory and visual input. That is, the visual and auditory inputs were presented simultaneously, and learners were tested on their ability to learn these simultaneous cross-modal associations. No studies to our knowledge have tested the extent that cross-modal statistical associations can be learned and integrated across time as elements in a sequence, in which an auditory stimulus (e.g., a tone) might be associated with the next occurrence of a particular visual stimulus (e.g., a shape) or vice-versa. In addition, previous studies have used multi-sensory patterns containing cross-modal regularities across sensory modalities, but none to our knowledge have tested learning of dependencies across different perceptual categories but that are within the same sensory modality (i.e., color and shape or tones and non-words). It is possible that learning cross-modal dependencies may have different computational demands than the learning of cross-categorical dependencies, perhaps due to differences in perceptual or attentional requirements.

The aim of the present study, therefore, was to investigate the limits of cross-domain statistical–sequential learning. From a purely associative learning framework, it might be hypothesized that statistical patterns should be learned just as readily between stimuli regardless of their modality or perceptual characteristics (i.e., learning a dependency between items A and B should not be any different than learning a dependency between items A and C). Such an unconstrained view of statistical learning was common in its early formulations (see Frensch and Runger, 2003 and Conway et al., 2007 for discussion). However, it is now known that statistical learning is constrained by attentional and perceptual factors. For example, statistical learning of non-adjacent relationships is heavily influenced by perceptual similarity, with learning improving when the non-adjacent elements are perceptually similar to one another (i.e., have a similar pitch range or share some other perceptual cue; Creel et al., 2004; Newport and Aslin, 2004; Gebhart et al., 2009). Likewise, Conway and Christiansen (2006) proposed that statistical learning is analogous to perceptual priming, in which networks of neurons in modality-specific brain regions show decreased activity when processing other items within the same modality that have similar underlying regularities or structure (see also, Reber et al., 1998; Chang and Knowlton, 2004; Conway et al., 2007). Recent neuroimaging evidence confirms that statistical learning is mediated at least in part by processing in unimodal, modality-specific brain regions (Turk-Browne et al., 2009) – in addition to involving “downstream” brain regions that appear less tied to a specific perceptual modality including Broca’s area, the basal ganglia, and the hippocampus (Lieberman et al., 2004; Opitz and Friederici, 2004; Petersson et al., 2004; Abla and Okanoya, 2008; Karuza et al., 2013; Schapiro et al., 2014). Thus the existing literature suggests that statistical learning involves a combination of bottom-up perceptual processing via unimodal, modality-specific mechanisms, but also more domain-general learning and integration processes that perhaps occur further downstream (Keele et al., 2003; Conway and Pisoni, 2008; Daltrozzo and Conway, 2014; Frost et al., 2015).

Thus, the learning of sequential patterns appears to be at least partly constrained by the nature of the sensory and perceptual processes that are engaged. Another way to think of this is that statistical learning is likely influenced by Gestalt-like principles that make it easier to learn associations between items in the same modality or that share perceptual features (Newport and Aslin, 2004). Consequently, statistical learning of cross-modal or cross-categorical sequential associations might be more challenging than the previous empirical research seems to indicate. It is possible that the previous demonstrations of multisensory integration during statistical learning tasks that used concurrent auditory and visual input (e.g., Cunillera et al., 2010; Mitchel and Weiss, 2011; Mitchel et al., 2014) were less cognitively demanding than learning elements across a temporal sequence. It is currently an open question to what extent statistical–sequential cross-modal and cross-categorical dependencies can be learned.

To test cross-modal and cross-categorical statistical learning, we employed an artificial grammar learning (AGL) paradigm, commonly used to study implicit and statistical learning (Seger, 1994; Perruchet and Pacton, 2006), in which stimuli are determined by a finite state grammar. Unlike previous statistical learning or AGL tasks, our paradigm used a series of inputs from different sensory modalities and/or perceptual categories, with each individual unit presented in succession. In this manner, we could test whether participants can learn cross-domain dependencies across the temporal sequence. The grammar itself (see Figure 1), created by Jamieson and Mewhort (2005) and also used by Conway et al. (2010), has certain advantages over other artificial grammars commonly used. First, unlike most other grammars including the classic “Reber” grammar (Reber, 1967) and countless others, there are no positional constraints. That is, each element of the grammar can occur at any position, with equal frequency, preventing position information – such as which elements or pairs of elements occur at the beginning versus the ending of sequences – from becoming a confound. Second, there are also no constraints on sequence length. A large set of stimuli can be generated at a particular length (such as length 6 used in the present study), preventing sequence length from becoming a confound. Finally, the grammar describes the probability in which a successive element (n+1) can occur given the previous element (n). This means that primarily first-order element transitions are contained in the grammar; thus, “learning the grammar” in this case generally means one thing: learning the forward-transition, adjacent element statistics¹, making interpretation about what is learned or not learned relatively straightforward. Consequently, this also makes it easy to design sequences containing both cross-domain and within-domain dependencies.

FIGURE 1

FIGURE 1. The artificial grammar used in both Experiments. “V” and “A” refer to visual and auditory stimuli, respectively.

In Experiment 1, participants were exposed to input sequences generated from the artificial grammar that were composed of tones interspersed with pictures of shapes. Importantly, the sequences consisted of both within-modal (e.g., tone–tone or shape–shape) and cross-modal associations (e.g., tone–shape or shape–tone). In Experiment 2 the sequences were composed of stimuli from two different perceptual categories within the same sensory modality (shapes and colors for the visual stimuli and tones and single syllable non-words for the auditory stimuli), allowing us to test cross-categorical learning. By incorporating a combination of within- and cross-modality stimuli (Experiment 1) and within- and cross-category stimuli (Experiment 2), we were able to examine to what extent participants naturally learn statistical–sequential patterns across sensory domains and perceptual categories.

Experiment 1: Learning Across Sensory Modalities

Materials and Methods

Participants

Fifteen undergraduate students from a Midwest university participated (Age Range = 18–23; Mean Age = 18.93; Females = 9). All were fluent English speakers. All participants were enrolled in college at the time of their participation. Participants received credit toward partial fulfillment of an undergraduate course as compensation for their time. The study was carried out in accordance with the recommendations of the Saint Louis University Institutional Review Board. All participants gave written informed consent in accordance with the Declaration of Helsinki.

Stimulus Materials

We used an artificial grammar consisting of three visual and three auditory elements. The visual elements were abstract black and white shapes used previously in a study by Joseph et al. (2005) and considered difficult to verbally label. The auditory elements were three pure tones that were generated using Audacity software, having frequencies of 210, 286, and 389 Hz, which neither conform to standard musical notes nor have standard musical intervals between them (as used in Conway and Christiansen, 2005).

Each sequence was generated by an artificial grammar with constrained probabilities (similar to those used in Jamieson and Mewhort, 2005 and Conway et al., 2010; See Figure 1). The grammar dictates that any given element can be followed by one element from the same sensory modality and one element from the other sensory modality. For example, if V1 is the starting element, it can be followed by either A2 or V2 with an equal probability (50/50%). Thus, V1–A2–A3–V3–A1–V1 is an example of a sequence that could be generated by this grammar; it contains four cross-modal dependencies (V1–A2; A3–V3; V3–A1; A1–V1) and one within-modal dependency (A2–A3; see Figure 2).

FIGURE 2

FIGURE 2. A possible grammatical sequence used in Experiment 1 (V1–A2–A3–V3–A1–V1).

Using the grammar presented in Figure 1, a single “learning” stream was generated and used for all participants, consisting of 180 stimuli presented in sequence. In addition, three types of six-item test sequences were constructed: grammatical sequences, ungrammatical sequences containing within-modal violations, and ungrammatical sequences containing cross-modal violations. To create within-modal violation sequences, all within-modal dependencies were altered so that they violated the grammar, with the cross-modal dependencies remaining grammatical. For cross-modal violation sequences, all cross-modal dependences were altered so that they violated the grammar, with the within-modal dependencies remaining grammatical. For example, in the case of a within-modal violation sequence, if the grammatical sequence was V1–A2–A3, the element A3 would be replaced with the other auditory element, so that the sequence would become V1–A2–A1. From that point, the grammar would be renewed and would continue correctly until another within-modal transition occurred. We constructed 20 grammatical test sequences, 10 within-modal ungrammatical test sequences, and 10 cross-modal ungrammatical test sequences. The total number of violations in the within-modal violation stimulus set (28 violations total or 2.8 violations on average per sequence) and cross-modal violation stimulus set (25 violations total or 2.5 violations on average per sequence) were roughly equal and not statistically different from each other (t = 0.669, p = 0.512). All test sequences are listed in the Appendix (Table A1).

Procedure

All participants completed a learning phase and a test phase. In the learning phase, participants were seated in front of a computer monitor with a pair of headphones. They were instructed to pay attention to the pictures and sounds that were displayed. Participants were exposed to the continuous stream of 180 shapes and tones that was generated using the grammar. The durations for both the auditory and visual stimuli were 1000 ms each, with an ISI of 1000 ms, giving a total learning phase duration of 6 min.

In the test phase of the experiment, participants were told that the input stream they had observed was created according to certain rules that determined the order that each element was presented. Participants were then presented with each of the six-item test sequences and were asked to determine if each item “followed the rules” (i.e., was grammatical) or “did not follow the rules” (i.e., was ungrammatical). Participants responded by pressing one of two buttons to indicate their choice. Participants were exposed to the novel grammatical, within-modal ungrammatical, and cross-modal ungrammatical sequences in random order. Within each test sequence, the stimulus durations (1000 ms) and ISI (1000 ms) were the same as used in the learning phase. Participants had as much time as needed to make their response, after which the next test trial began. Note that for both the learning and test phases the auditory and visual tokens were randomly assigned and mapped to the elements of the grammar. For one participant A1 might be the 210 Hz tone, but for another participant A1 might be the 286 Hz tone, etc. Thus, even though each participant received the same learning and test items in terms of their underlying structural patterns, the actual tokens that mapped onto these patterns differed for each participant, determined randomly.

Results and Discussion

Results are shown in Table 1, displaying the percentage of test items classified correctly for each test item type. Performance on the within-modal violation sequences was numerically the highest (M = 64.00%) followed by performance on the grammatical sequences (M = 60.67%), and lastly the cross-modal violations (M = 48.67%). To explore the accuracy of participants’ performance on the three item types, a repeated measures analysis of variance (ANOVA) was conducted, indicating a significant main effect of sequence type [F(2,28) = 4.893, p = 0.015, $η_{p}^{2}$ = 0.259]. A test of simple comparisons with a Bonferroni correction indicated that there was a statistically significant difference between performance on the within-modal violations and the cross-modal violations (p < 0.05). A series of single sample t-tests was run to test the average performance on each item type to chance (50%). The analysis indicated that participants performed significantly above chance on the grammatical items (t = 4.00, p ≤ 0.05, Cohen’s d = 1.04) and the within-modal ungrammatical items (t = 4.18, p ≤ 0.001, d = 1.08), but not on the cross-modal ungrammatical items (t = –0.31, p ≥ 0.10, d = 0.08).²

TABLE 1

TABLE 1. Mean Performance (percent correct) and standard deviations for Experiments 1 and 2.

The findings from Experiment 1 indicate that participants were more proficient at detecting within-modal violations – that is, violations occurring between stimuli in the same sensory modality – than at detecting cross-modal violations that occurred between stimuli in different modalities. In fact, participants were completely unable to successfully detect violations of cross-modal contingencies. The lack of cross-modal learning stands in contrast to previous studies of multimodal statistical learning, which differed from the present study by their use of simultaneous rather than sequential cross-modal dependencies.

Experiment 2: Learning Across Perceptual Categories within a Single Sensory Modality

The results of Experiment 1 demonstrate that when exposed to multimodal sequential patterns, within-modal but not cross-modal violations can be detected. In Experiment 1, the cross-modal associations were multisensory (i.e., consisting of audio–visual or visual–audio links). Another way to probe multimodal learning is to test the ability to learn associations that are within the same sensory modality (e.g., vision or audition) but that exist between different perceptual categories (e.g., tones and words or colors and shapes).