Lappin, Shalom (2017). The probabilistic representation of linguistic knowledge: Linguistic data sets annotated for grammatical acceptability. [Data Collection]. Colchester, Essex: UK Data Archive. 10.5255/UKDA-SN-851856
This is the latest version of this item.
SMOG is exploring the construction of an enriched stochastic model that represents the syntactic knowledge that native speakers of English have of their language.
We are hoping that this kind of model will provide a straightforward explanation for the fact that individual native speakers generally judge the well formedness of sentences along a continuum, rather than through the imposition of a sharp boundary between acceptable and unacceptable sentences.
We are experimenting with different sorts of language models that contain a variety of parameters encoding properties of sentences and probability distributions over corpora.
We are training these models on subsets of the British National Corpus (BNC), and we are testing them on additional subsets of the BNC into which we have introduced grammatical deformations and infelicities of varying degrees of severity and subtlety.
We hope to show that a sufficiently complex enriched language model can encode a fair amount of what native speakers know about the syntax of their language.
This research holds out the prospect of important impact in two areas.
(1) It can shed light on the relationship between the representation and acquisition of linguistic knowledge on one hand, and learning and the encoding of knowledge in other cognitive domains. This can, in turn, help to clarify the respective roles of biologically conditioned learning biases and data driven learning in human cognition.
(2) This work can contribute to the development of more effective language technology by providing insight into the way in which humans represent the syntactic properties of sentences in their language. To the extent that natural language processing systems take account of this class of representations they will provide more efficient tools for parsing and interpreting text and speech.
Data description (abstract)
The files contain crowd sourced (Amazon Mechanical Turk) speaker annotated sentences in several domains, and for several languages. The annotations are mean acceptability judgements in several modes of presentation.
Full documentation of the experimental protocols through which the annotation of these data sets was obtained is provided on the Statistical Models of Grammaticality website (SMOG), please see the related resources section to access the (SMOG) website.
This data collection contains the linguistic data sets in excel, and two papers which explain the project and data and experiments in greater detail.
Data creators: |
|
|||||||||
---|---|---|---|---|---|---|---|---|---|---|
Contributors: |
|
|||||||||
Sponsors: | Economic and Social Research Council | |||||||||
Grant reference: | ES/J022969/1 | |||||||||
Topic classification: |
Science and technology Psychology |
|||||||||
Keywords: | linguistic data | |||||||||
Project title: | The Probabilistic Representation of Linguistic Knowledge | |||||||||
Grant holders: | Shalom Lappin | |||||||||
Project dates: |
|
|||||||||
Date published: | 28 May 2015 16:09 | |||||||||
Last modified: | 18 Apr 2017 09:19 | |||||||||
Available Files
Data and documentation bundle
Downloads
Altmetric
Related Resources
Data collections
Amazon Mechanical Turk: Sentence annotation experiments |
Website
The Probabilistic Representation of Linguistic Knowledge: ESRC award information |
Project website |