International Centre for Language and Communicative Development: Corpus and Experimental Study: Children's Acquisition of Wh-questions, 2019

McCauley, Stewart and Bannard, Colin and Theakston, Anna and Davis, Michelle and Cameron-Faulkner, Thea and Ambridge, Ben (2022). International Centre for Language and Communicative Development: Corpus and Experimental Study: Children's Acquisition of Wh-questions, 2019. [Data Collection]. Colchester, Essex: UK Data Service. 10.5255/UKDA-SN-853883

The International Centre for Language and Communicative Development (LuCiD) will bring about a transformation in our understanding of how children learn to communicate, and deliver the crucial information needed to design effective interventions in child healthcare, communicative development and early years education.
Learning to use language to communicate is hugely important for society. Failure to develop language and communication skills at the right age is a major predictor of educational and social inequality in later life. To tackle this problem, we need to know the answers to a number of questions: How do children learn language from what they see and hear? What do measures of children's brain activity tell us about what they know? and How do differences between children and differences in their environments affect how children learn to talk? Answering these questions is a major challenge for researchers. LuCiD will bring together researchers from a wide range of different backgrounds to address this challenge.
The LuCiD Centre will be based in the North West of England and will coordinate five streams of research in the UK and abroad. It will use multiple methods to address central issues, create new technology products, and communicate evidence-based information directly to other researchers and to parents, practitioners and policy-makers.
LuCiD's RESEARCH AGENDA will address four key questions in language and communicative development: 1) ENVIRONMENT: How do children combine the different kinds of information that they see and hear to learn language? 2) KNOWLEDGE: How do children learn the word meanings and grammatical categories of their language? 3) COMMUNICATION: How do children learn to use their language to communicate effectively? 4) VARIATION: How do children learn languages with different structures and in different cultural environments?
The fifth stream, the LANGUAGE 0-5 PROJECT, will connect the other four streams. It will follow 80 English learning children from 6 months to 5 years, studying how and why some children's language development is different from others. A key feature of this project is that the children will take part in studies within the other four streams. This will enable us to build a complete picture of language development from the very beginning through to school readiness.
Applying different methods to study children's language development will constrain the types of explanations that can be proposed, helping us create much more accurate theories of language development. We will observe and record children in natural interaction as well as studying their language in more controlled experiments, using behavioural measures and correlations with brain activity (EEG). Transcripts of children's language and interaction will be analysed and used to model how these two are related using powerful computer algorithms.
LuciD's TECHNOLOGY AGENDA will develop new multi-method approaches and create new technology products for researchers, healthcare and education professionals. We will build a 'big data' management and sharing system to make all our data freely available; create a toolkit of software (LANGUAGE RESEARCHER'S TOOLKIT) so that researchers can analyse speech more easily and more accurately; and develop a smartphone app (the BABYTALK APP) that will allow parents, researchers and practitioners to monitor, assess and promote children's language development.
With the help of six IMPACT CHAMPIONS, LuCiD's COMMUNICATIONS AGENDA will ensure that parents know how they can best help their children learn to talk, and give healthcare and education professionals and policy-makers the information they need to create intervention programmes that are firmly rooted in the latest research findings.

Data description (abstract)

Corpus data analysis (Python scripts) audio recordings (audio files), stimuli (a complete OpenSesame Experiment file), transcribed and coded responses (.csv file) data analysis scripts (R files) from a training study described by the following abstract:
Subject-auxiliary inversion in interrogatives has been a topic of great interest in language acquisition research, and has often been held up as evidence for the structure-dependence of grammar. This applies especially to wh- questions, which are argued to be structurally more complex than polar interrogatives. Non-inversion errors are of particular interest as they represent a rare case in which children reliably make errors involving word order. Usage-based and nativist approaches posit different representations and processes underlying children’s question formation and therefore predict different causes for these errors. Here, we explore the question of whether input statistics predict children’s spontaneous non-inversion errors with wh- questions. While previous work has focused primarily on the distributional properties of wh- words and auxiliaries themselves, we consider the statistical properties of additional subsequences. In particular, we look at properties of the non-inverted, errorful forms of questions. In keeping with recent evidence for multiword units in children’s comprehension and production, we explore the question of whether the frequency of uninverted subsequences (e.g., “she is going” in “what she is going to do?*”) is a good predictor of children’s errors. First, through a series of corpus analyses, we show that this is indeed the case. Second, we conduct an experiment in which children are asked to repeat wh-questions which, although matched for the n-grams in the well-formed questions, differ as to the n-gram frequencies of the crucial bigrams.

Data creators:

Creator Name	Affiliation	ORCID (as URL)
McCauley Stewart	University of Iowa
Bannard Colin	University of Liverpool
Theakston Anna	University of Manchester	https://orcid.org/0000-0002-9483-7893
Davis Michelle	University of Manchester
Cameron-Faulkner Thea	University of Manchester	https://orcid.org/0000-0001-6928-4329
Ambridge Ben	University of Liverpool	https://orcid.org/0000-0003-2389-8477

Sponsors:

Economic and Social Research Council

Grant reference:

ES/L008955/1

Topic classification:

Psychology

Keywords:

LINGUISTICS, LANGUAGE, GRAMMAR SKILLS

Project title:

The International Centre for Language and Communicative Development

Alternative title:

LuCiD WP7

Grant holders:

Elena Lieven, Bob McMurray, Jeffrey Elman, Gert Westermann, Morten H Christiansen, Thea Cameron-Faulkner, Fernand Gobet, Ludovica Serratrice, Sabine Stoll, Meredith Rowe, Padraic Monaghan, Michael Tomasello, Ben Ambridge, Silke Brandt, Anna Theakston, Eugenio Parise, Caroline Frances Rowland, Colin James Bannard, Grzegorz Krajewski, Franklin Chang, Floriana Grasso, Evan James Kidd, Julian Mark Pine, Arielle Borovsky, Vincent Michael Reid, Katherine Alcock, Daniel Freudenthal

Project dates:

From	To
1 September 2014	31 May 2020

Date published:

26 Aug 2021 16:49

Last modified:

09 Mar 2022 22:22

Coverage and Methodology

Collection period:

Date from:	Date to:
25 January 2019	27 June 2019

Country:

United Kingdom

Data collection method:

The data collection method for the corpus analysis is described in detail below. The data collection for the experimental part of the study – will be available once we have submitted that study for publication, at which point, it will be downloadable at: https://osf.io/74urw/ In brief, the method involves children repeating an the experimenter’s questions to a talking dog (who “doesn’t talk to grown ups”), who then answers. The target questions, and dog’s answers, are shown below. Each pair is matched for the n-gram frequencies of the well-formed questions, but varies as to the n-gram frequencies of the non-inverted question (e.g., “*What I will get?”) – low (top item in each pair) vs high (bottom item in each pair).
The corpus analysis consisted of three general phases: extraction of all child-produced wh- questions from a set of target corpora, followed by semi-automated identification of uninversion errors; collection of n-gram statistics for child-directed speech in English; and mixed-effects logistic regression modeling to determine which n-gram statistics predicted uninversion errors in the extracted questions.
Target Corpus Selection and Preparation Procedure
We began by extracting, from the English language portion of the CHILDES database (MacWhinney, 2000), the 12 corpora with the highest number of wh- questions. Each corpus followed a single target child and spanned at least one year of development.Each corpus was then prepared for analysis using an automated procedure which removed codes, tags, and punctuation, leaving only speaker identifiers and the original sequence of words. Lines consisting solely of morphological tags (included as standard in CHILDES corpora) were unaffected by this procedure and retained for later use in extracting uninversion errors.
As part of this procedure, contractions were split into their component words: e.g., “what’s he doing” was re-coded as “what is he doing.” This step ensured that modeling work reflected accurate n-gram frequencies for wh- words and auxiliaries across all questions. As a further step we collapsed the pronouns “she” and “he” into a single form to control for individual differences across children’s exposure to gender pronouns.
Wh-Question and Uninversion Error Candidate Extraction Coding
Child-produced wh- questions were automatically extracted from the target corpora by utilizing the standard default morphological tagging included in CHILDES. All extracted questions featured a wh- word in the first position, followed immediately by an auxiliary. This yielded approximately 13,000 child-produced wh- questions across the 12 corpora.
For the purpose of automatically identifying possible uninversion errors, we extracted, from the full corpora, all child questions which featured a wh- word in the initial position which was not immediately followed by an auxiliary. These candidate items were then coded for error type by hand, yielding a total of 300 identified uninversion errors produced across the target children. Wh- questions featuring an error type other than uninversion (such as doubling or omission errors) were excluded from our dataset. Importantly, our analyses were restricted to questions produced within the first five years of age.
N-Gram Data Collection@: In order to capture n-gram statistics which accurately reflected the nature of child-directed speech in the English language, we gathered n-gram frequencies for the entire English (UK and US) portion of the CHILDES database. This allowed us to overcome issues of data sparseness arising from corpus size (cf. Manning & Schutze, 1999).
The aggregated corpus was prepared for data collection following the same procedure described in the above subsection. Frequencies were then collected for unigrams (single words), bigrams (word pairs), and trigrams (word triplets), which were then applied to each of the wh- questions extracted for the 12 target child corpora. To this end, n-gram statistics were calculated for each position (separate unigram counts for each word, separate bigram counts for each word pair, and so forth). Thus, for the question “what is that,” three unigram counts (one for each of three word positions), two bigram counts (one for each of two word pair positions), and one trigram count (for the single word triplet position) were available.
Because our statistical analyses aimed to explore the role of multiword chunk frequency in shaping children’s uninversion errors, we sought to directly compare the correctly inverted “target question” for children’s uninversion errors to the correctly inverted questions which made up the rest of the dataset. To achieve this, we calculated n-gram frequencies for the correctly inverted forms of the uninverted questions identified by the earlier procedure. Uninversion errors were “corrected” by hand in order to achieve this.
By the same token, we also sought to explore the role of multiword sequence frequencies for the relevant uninverted question forms in determining error rates. For this, we retained the original child uninversion errors and employed an automated procedure to produce the errorful, uninverted form corresponding to each correctly inverted question in the corpus. The second and third words could not simply be swapped because a large number of questions featured multiword subject noun phrases, such as “where is my red ball?” Thus, to automatically achieve a realistic uninverted form across such a large number of questions, we first chunked utterances using a shallow parser (Punyakanok & Roth, 2001). Shallow parsers are widely used tools in the field of natural language processing which segment out the non-overlapping, non-embedded phrases in a text. For instance, the shallow parser output for the previous example would be: “[where] [is] [my red ball].” After submitting all correctly inverted questions to the shallow parser, we merely switched the second and third chunks, yielding the relevant, uninverted errorful forms, such as “where my red ball is?”
Thus, we collected unigram, bigram, and trigram statistics for each position across all correctly inverted questions (and, in the case of uninversion errors, the correctly inverted target questions), alongside a separate set of n-gram statistics for the uninversion errors (and, in the case of correctly inverted questions, the relevant errorful form).
Mixed Effects Logistic Regression Analysis
In order to evaluate the predictive relationship between multiword chunk frequency and uninversion errors, we used mixed-effects logistic regression modeling (cf. Agresti, 2002). We carried out a set of model comparisons to determine which n-gram frequencies were uniquely predictive of the relationship. This involved selecting predictors at each n-gram level separately, starting at the unigram level before moving to the bigram level, followed by the trigram level.
Questions originally produced by the target children in their correctly inverted form were coded as 0, while questions produced in an errorful, uninverted form were coded as 1. N-gram frequencies were then used as predictors for this binary variable. All models included a random intercept for child, following the notion that the 12 target children may differ in the extent to which their errors could be predicted by n-gram frequencies.
Our model comparisons sought to evaluate n-gram frequencies for both the correctly inverted question and their corresponding uninverted (errorful) forms as predictors of child uninversion error. The model comparison procedure was designed such that the risk of false positives was insignificant. Importantly, all predictors were log-transformed and scaled (in order to prevent model convergence failure). All model comparisons were carried out using log-likelihood ratio tests.
Starting at the unigram level, we used a leave-one-out procedure to determine which predictors
explained variance over and above that explained by any other variable. The full baseline model at this level included random effects of the first 5 unigrams (by child) as well as fixed effects for
all 5 unigrams. This was then compared to five subsequent models, each leaving out the fixed effect term for a different unigram (random effects by child were included for every unigram in
each model). Removal of only the first two unigrams harmed model fit to a significant extent, according to log-likelihood tests. Thus, these two unigrams were held over for the next level of
model comparisons.
The same procedure described for unigrams was then carried out for the first four bigrams, but with random (by child) and fixed effects for the first two unigrams also included in each model (as unigrams are identical across the inverted and uninverted forms, only one set was included in the previous step). Importantly, bigrams from both the correctly inverted and the corresponding errorful forms were included at this second step.
For correctly inverted question forms, removal of the third and fourth bigrams harmed model fit to a statistically significant extent, according to the log-likelihood tests, while for the uninverted forms, removal of the second, third, and fourth bigrams harmed model fit. Thus, in addition to the first two unigrams from the previous step, the third and fourth bigrams from the correctly-inverted question forms and the second, third, and fourth bigrams from the errorful (uninverted) forms were held over for the final set of model comparisons.
For the first three trigrams, the same procedure was followed once more (with random and fixed effects for the first two unigrams and first two bigrams). Only removal of the second and third trigrams from the uninverted/errorful question forms harmed
model fit to a significant extent.
Thus, the final set of predictors included the first two unigrams, the third and fourth bigrams from the correctly inverted forms, the second, third, and fourth bigrams from the uninverted forms, and the second and third trigrams from the uninverted forms.