The national corpus of contemporary Welsh, 2016-2020

Knight, Dawn and Morris, Steve and Fitzpatrick, Tess and Rayson, Paul and Spasić, Irena and Thomas, Enlli Môn and Lovell, Alex and Morris, Jonathan and Evas, Jeremy and Stonelake, Mark and Arman, Laura and Davies, Joshua and Ezeani, Ignatius and Neale, Steven and Needs, Jennifer and Piao, Scott and Rees, Mair and Watkins, Gareth and Williams, Lowri and Muralidaran, Vignesh and Tovey-Walsh, Bethan and Anthony, Laurence and Cobb, Tom and Deuchar, Margaret and Donnelly, Kevin and McCarthy, Michael and Scannell, Kevin (2021). The national corpus of contemporary Welsh, 2016-2020. [Data Collection]. Colchester, Essex: UK Data Service. 10.5255/UKDA-SN-854531

CorCenCC is an inter-disciplinary and multi-institutional project that has created a large- scale, open-source corpus of contemporary Welsh. A corpus, in this context, is a collection of examples of spoken, written and/or e-language examples from real life contexts, that allows users to identify and explore language as it is actually used, rather than relying on intuition or prescriptive accounts of how it ‘should’ be used. Corpora let us investigate how we use language across different genres and communicative mediums (i.e. spoken, written or digital), and how it varies according to the speaker/writer and the communicative purpose. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, speech recognition and web search tools. CorCenCC will provide societal, economic and academic benefits by: (1) Facilitating uses of Welsh in public, commercial, educational and governmental settings. (2) Redefining the scope, relevance and design infrastructure of corpus development methodology. CorCenCC is open-source and publicly accessible, with user interfaces for specific groups. It enables, for example, community users to investigate dialect variation or idiosyncrasies of their own language use; professional users to profile texts for readability or develop digital language tools; language learners learn from real life models of Welsh; and researchers to investigate patterns of language use and change. The project team comprised experts in corpus linguistics, Welsh, and language pedagogy and assessment, who specialise in the application of linguistic tools to real world issues. Working with an advisory body of stakeholder representatives, they were optimally placed to meet the project aims: creating a permanent, sustainable and fit-for-purpose record of the living language, and pioneering an approach to content generation and user-driven applications that will provide a model for future corpus creation.

Data description (abstract)

The CorCenCC corpus contains over 11 million words (circa 14.4m tokens). CorCenCC is the first corpus of the Welsh language that covers all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language). It offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces. It includes examples of news headlines, personal and professional emails and correspondence, academic writing, formal and informal speech, blog posts and text messaging. Language data was sampled from a range of different speakers and users of Welsh, from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect the diversity of text types and of Welsh speakers found in contemporary Wales. In this way, the CorCenCC corpus provides the means for empowering users of Welsh to better understand and observe the language across diverse settings, and creates a solid evidence base for the teaching of contemporary Welsh to those who aspire to use it. Over time, the corpus has the potential to make a significant contribution to the transformation of Welsh as the language of public, commercial, education and governmental discourse. A beta version of some bilingual corpus query tools have also been created as part of the CorCenCC project (see Related Resources). These include simple query, full query, frequency list, n-gram, keyword and collocation functionalities. The CorCenCC website also contains Y Tiwtiadur, a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises: Gap Filling (Cloze), Vocabulary Profiler, Word Identification and Word-in-Context. To access this tool, see Related Resources.

Data creators:
Creator Name Affiliation ORCID (as URL)
Knight Dawn Cardiff University https://orcid.org/0000-0002-4745-6502
Morris Steve Swansea University
Fitzpatrick Tess Swansea University
Rayson Paul Lancaster University
Spasić Irena Cardiff University
Thomas Enlli Môn Bangor University
Lovell Alex Swansea University
Morris Jonathan Cardiff University
Evas Jeremy Cardiff University
Stonelake Mark Swansea University
Arman Laura Cardiff University
Davies Joshua Bangor University
Ezeani Ignatius Lancaster University
Neale Steven Cardiff University
Needs Jennifer Swansea University
Piao Scott Lancaster University
Rees Mair Swansea University
Watkins Gareth Cardiff University
Williams Lowri Cardiff University
Muralidaran Vignesh Cardiff University
Tovey-Walsh Bethan Swansea University
Anthony Laurence Waseda University
Cobb Tom University of Quebec at Montreal
Deuchar Margaret University of Cambridge
Donnelly Kevin N/A
McCarthy Michael The University of Nottingham
Scannell Kevin Saint Louis University
Sponsors: Economic and Social Research Council, Arts and Humanities Research Council
Grant reference: ES/M011348/1
Topic classification: Media, communication and language
Society and culture
Keywords: LINGUISTICS, WELSH (LANGUAGE), PEDAGOGY, TEACHING, COMMUNITIES
Project title: Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction
Grant holders: Dawn Knight, Lovell Alexander, Morris Steven Dyfrig, Thomas Enlli, Morris Jonathan, Stonelake Edmund, Spasic Irena, Fitzpatrick Tess, Evas Jeremy, Rayson Paul
Project dates:
FromTo
1 March 201630 November 2020
Date published: 27 Jan 2021 10:58
Last modified: 31 Jan 2021 21:37

Available Files

No Files to display

Downloads

data downloads and page views since this item was published

View more statistics

Altmetric

Edit item (login required)

Edit Item Edit Item