Grant holders: |
Elena Lieven, Bob McMurray, Jeffrey Elman, Gert Westermann, Morten H Christiansen, Thea Cameron-Faulkner, Fernand Gobet, Ludovica Serratrice, Sabine Stoll, Meredith Rowe, Padraic Monaghan, Michael Tomasello, Ben Ambridge, Silke Brandt, Anna Theakston, Eugenio Parise, Caroline Frances Rowland, Colin James Bannard, Grzegorz Krajewski, Franklin Chang, Floriana Grasso, Evan James Kidd, Julian Mark Pine, Arielle Borovsky, Vincent Michael Reid, Katherine Alcock, Daniel Freudenthal
|
Collection period: |
Date from: | Date to: |
---|
25 January 2019 | 27 June 2019 |
|
Country: |
United Kingdom |
Data collection method: |
The data collection method for the corpus analysis is described in detail below. The data collection for the experimental part of the study – will be available once we have submitted that study for publication, at which point, it will be downloadable at: https://osf.io/74urw/ In brief, the method involves children repeating an the experimenter’s questions to a talking dog (who “doesn’t talk to grown ups”), who then answers. The target questions, and dog’s answers, are shown below. Each pair is matched for the n-gram frequencies of the well-formed questions, but varies as to the n-gram frequencies of the non-inverted question (e.g., “*What I will get?”) – low (top item in each pair) vs high (bottom item in each pair).
The corpus analysis consisted of three general phases: extraction of all child-produced wh- questions from a set of target corpora, followed by semi-automated identification of uninversion errors; collection of n-gram statistics for child-directed speech in English; and mixed-effects logistic regression modeling to determine which n-gram statistics predicted uninversion errors in the extracted questions.
Target Corpus Selection and Preparation Procedure
We began by extracting, from the English language portion of the CHILDES database (MacWhinney, 2000), the 12 corpora with the highest number of wh- questions. Each corpus followed a single target child and spanned at least one year of development.Each corpus was then prepared for analysis using an automated procedure which removed codes, tags, and punctuation, leaving only speaker identifiers and the original sequence of words. Lines consisting solely of morphological tags (included as standard in CHILDES corpora) were unaffected by this procedure and retained for later use in extracting uninversion errors.
As part of this procedure, contractions were split into their component words: e.g., “what’s he doing” was re-coded as “what is he doing.” This step ensured that modeling work reflected accurate n-gram frequencies for wh- words and auxiliaries across all questions. As a further step we collapsed the pronouns “she” and “he” into a single form to control for individual differences across children’s exposure to gender pronouns.
Wh-Question and Uninversion Error Candidate Extraction Coding
Child-produced wh- questions were automatically extracted from the target corpora by utilizing the standard default morphological tagging included in CHILDES. All extracted questions featured a wh- word in the first position, followed immediately by an auxiliary. This yielded approximately 13,000 child-produced wh- questions across the 12 corpora.
For the purpose of automatically identifying possible uninversion errors, we extracted, from the full corpora, all child questions which featured a wh- word in the initial position which was not immediately followed by an auxiliary. These candidate items were then coded for error type by hand, yielding a total of 300 identified uninversion errors produced across the target children. Wh- questions featuring an error type other than uninversion (such as doubling or omission errors) were excluded from our dataset. Importantly, our analyses were restricted to questions produced within the first five years of age.
N-Gram Data Collection@: In order to capture n-gram statistics which accurately reflected the nature of child-directed speech in the English language, we gathered n-gram frequencies for the entire English (UK and US) portion of the CHILDES database. This allowed us to overcome issues of data sparseness arising from corpus size (cf. Manning & Schutze, 1999).
The aggregated corpus was prepared for data collection following the same procedure described in the above subsection. Frequencies were then collected for unigrams (single words), bigrams (word pairs), and trigrams (word triplets), which were then applied to each of the wh- questions extracted for the 12 target child corpora. To this end, n-gram statistics were calculated for each position (separate unigram counts for each word, separate bigram counts for each word pair, and so forth). Thus, for the question “what is that,” three unigram counts (one for each of three word positions), two bigram counts (one for each of two word pair positions), and one trigram count (for the single word triplet position) were available.
Because our statistical analyses aimed to explore the role of multiword chunk frequency in shaping children’s uninversion errors, we sought to directly compare the correctly inverted “target question” for children’s uninversion errors to the correctly inverted questions which made up the rest of the dataset. To achieve this, we calculated n-gram frequencies for the correctly inverted forms of the uninverted questions identified by the earlier procedure. Uninversion errors were “corrected” by hand in order to achieve this.
By the same token, we also sought to explore the role of multiword sequence frequencies for the relevant uninverted question forms in determining error rates. For this, we retained the original child uninversion errors and employed an automated procedure to produce the errorful, uninverted form corresponding to each correctly inverted question in the corpus. The second and third words could not simply be swapped because a large number of questions featured multiword subject noun phrases, such as “where is my red ball?” Thus, to automatically achieve a realistic uninverted form across such a large number of questions, we first chunked utterances using a shallow parser (Punyakanok & Roth, 2001). Shallow parsers are widely used tools in the field of natural language processing which segment out the non-overlapping, non-embedded phrases in a text. For instance, the shallow parser output for the previous example would be: “[where] [is] [my red ball].” After submitting all correctly inverted questions to the shallow parser, we merely switched the second and third chunks, yielding the relevant, uninverted errorful forms, such as “where my red ball is?”
Thus, we collected unigram, bigram, and trigram statistics for each position across all correctly inverted questions (and, in the case of uninversion errors, the correctly inverted target questions), alongside a separate set of n-gram statistics for the uninversion errors (and, in the case of correctly inverted questions, the relevant errorful form).
Mixed Effects Logistic Regression Analysis
In order to evaluate the predictive relationship between multiword chunk frequency and uninversion errors, we used mixed-effects logistic regression modeling (cf. Agresti, 2002). We carried out a set of model comparisons to determine which n-gram frequencies were uniquely predictive of the relationship. This involved selecting predictors at each n-gram level separately, starting at the unigram level before moving to the bigram level, followed by the trigram level.
Questions originally produced by the target children in their correctly inverted form were coded as 0, while questions produced in an errorful, uninverted form were coded as 1. N-gram frequencies were then used as predictors for this binary variable. All models included a random intercept for child, following the notion that the 12 target children may differ in the extent to which their errors could be predicted by n-gram frequencies.
Our model comparisons sought to evaluate n-gram frequencies for both the correctly inverted question and their corresponding uninverted (errorful) forms as predictors of child uninversion error. The model comparison procedure was designed such that the risk of false positives was insignificant. Importantly, all predictors were log-transformed and scaled (in order to prevent model convergence failure). All model comparisons were carried out using log-likelihood ratio tests.
Starting at the unigram level, we used a leave-one-out procedure to determine which predictors
explained variance over and above that explained by any other variable. The full baseline model at this level included random effects of the first 5 unigrams (by child) as well as fixed effects for
all 5 unigrams. This was then compared to five subsequent models, each leaving out the fixed effect term for a different unigram (random effects by child were included for every unigram in
each model). Removal of only the first two unigrams harmed model fit to a significant extent, according to log-likelihood tests. Thus, these two unigrams were held over for the next level of
model comparisons.
The same procedure described for unigrams was then carried out for the first four bigrams, but with random (by child) and fixed effects for the first two unigrams also included in each model (as unigrams are identical across the inverted and uninverted forms, only one set was included in the previous step). Importantly, bigrams from both the correctly inverted and the corresponding errorful forms were included at this second step.
For correctly inverted question forms, removal of the third and fourth bigrams harmed model fit to a statistically significant extent, according to the log-likelihood tests, while for the uninverted forms, removal of the second, third, and fourth bigrams harmed model fit. Thus, in addition to the first two unigrams from the previous step, the third and fourth bigrams from the correctly-inverted question forms and the second, third, and fourth bigrams from the errorful (uninverted) forms were held over for the final set of model comparisons.
For the first three trigrams, the same procedure was followed once more (with random and fixed effects for the first two unigrams and first two bigrams). Only removal of the second and third trigrams from the uninverted/errorful question forms harmed
model fit to a significant extent.
Thus, the final set of predictors included the first two unigrams, the third and fourth bigrams from the correctly inverted forms, the second, third, and fourth bigrams from the uninverted forms, and the second and third trigrams from the uninverted forms. |
Observation unit: |
Individual |
Kind of data: |
Numeric, Text, Still image, Audio, Software |
Type of data: |
Experimental data
|
Resource language: |
English |
|