Knight, Dawn and Morris, Steve and Fitzpatrick, Tess and Rayson, Paul and Spasić, Irena and Thomas, Enlli Môn and Lovell, Alex and Morris, Jonathan and Evas, Jeremy and Stonelake, Mark and Arman, Laura and Davies, Joshua and Ezeani, Ignatius and Neale, Steven and Needs, Jennifer and Piao, Scott and Rees, Mair and Watkins, Gareth and Williams, Lowri and Muralidaran, Vignesh and Tovey-Walsh, Bethan and Anthony, Laurence and Cobb, Tom and Deuchar, Margaret and Donnelly, Kevin and McCarthy, Michael and Scannell, Kevin
(2021).
The national corpus of contemporary Welsh, 2016-2020.
[Data Collection]. Colchester, Essex:
UK Data Service.
10.5255/UKDA-SN-854531
CorCenCC is an inter-disciplinary and multi-institutional project that has created a large- scale, open-source corpus of contemporary Welsh. A corpus, in this context, is a collection of examples of spoken, written and/or e-language examples from real life contexts, that allows users to identify and explore language as it is actually used, rather than relying on intuition or prescriptive accounts of how it ‘should’ be used. Corpora let us investigate how we use language across different genres and communicative mediums (i.e. spoken, written or digital), and how it varies according to the speaker/writer and the communicative purpose. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, speech recognition and web search tools.
CorCenCC will provide societal, economic and academic benefits by:
(1) Facilitating uses of Welsh in public, commercial, educational and governmental settings. (2) Redefining the scope, relevance and design infrastructure of corpus development methodology.
CorCenCC is open-source and publicly accessible, with user interfaces for specific groups. It enables, for example, community users to investigate dialect variation or idiosyncrasies of their own language use; professional users to profile texts for readability or develop digital language tools; language learners learn from real life models of Welsh; and researchers to investigate patterns of language use and change.
The project team comprised experts in corpus linguistics, Welsh, and language pedagogy and assessment, who specialise in the application of linguistic tools to real world issues. Working with an advisory body of stakeholder representatives, they were optimally placed to meet the project aims: creating a permanent, sustainable and fit-for-purpose record of the living language, and pioneering an approach to content generation and user-driven applications that will provide a model for future corpus creation.
Data description (abstract)
The CorCenCC corpus contains over 11 million words (circa 14.4m tokens). CorCenCC is the first corpus of the Welsh language that covers all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language). It offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces. It includes examples of news headlines, personal and professional emails and correspondence, academic writing, formal and informal speech, blog posts and text messaging. Language data was sampled from a range of different speakers and users of Welsh, from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect the diversity of text types and of Welsh speakers found in contemporary Wales. In this way, the CorCenCC corpus provides the means for empowering users of Welsh to better understand and observe the language across diverse settings, and creates a solid evidence base for the teaching of contemporary Welsh to those who aspire to use it. Over time, the corpus has the potential to make a significant contribution to the transformation of Welsh as the language of public, commercial, education and governmental discourse.
A beta version of some bilingual corpus query tools have also been created as part of the CorCenCC project (see Related Resources). These include simple query, full query, frequency list, n-gram, keyword and collocation functionalities. The CorCenCC website also contains Y Tiwtiadur, a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises: Gap Filling (Cloze), Vocabulary Profiler, Word Identification and Word-in-Context. To access this tool, see Related Resources.
Data creators: |
Creator Name |
Affiliation |
ORCID (as URL) |
Knight Dawn |
Cardiff University |
https://orcid.org/0000-0002-4745-6502
|
Morris Steve |
Swansea University |
|
Fitzpatrick Tess |
Swansea University |
|
Rayson Paul |
Lancaster University |
|
Spasić Irena |
Cardiff University |
|
Thomas Enlli Môn |
Bangor University |
|
Lovell Alex |
Swansea University |
|
Morris Jonathan |
Cardiff University |
|
Evas Jeremy |
Cardiff University |
|
Stonelake Mark |
Swansea University |
|
Arman Laura |
Cardiff University |
|
Davies Joshua |
Bangor University |
|
Ezeani Ignatius |
Lancaster University |
|
Neale Steven |
Cardiff University |
|
Needs Jennifer |
Swansea University |
|
Piao Scott |
Lancaster University |
|
Rees Mair |
Swansea University |
|
Watkins Gareth |
Cardiff University |
|
Williams Lowri |
Cardiff University |
|
Muralidaran Vignesh |
Cardiff University |
|
Tovey-Walsh Bethan |
Swansea University |
|
Anthony Laurence |
Waseda University |
|
Cobb Tom |
University of Quebec at Montreal |
|
Deuchar Margaret |
University of Cambridge |
|
Donnelly Kevin |
N/A |
|
McCarthy Michael |
The University of Nottingham |
|
Scannell Kevin |
Saint Louis University |
|
|
Sponsors: |
Economic and Social Research Council, Arts and Humanities Research Council
|
Grant reference: |
ES/M011348/1
|
Topic classification: |
Media, communication and language Society and culture
|
Keywords: |
LINGUISTICS, WELSH (LANGUAGE), PEDAGOGY, TEACHING, COMMUNITIES
|
Project title: |
Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction
|
Grant holders: |
Dawn Knight, Lovell Alexander, Morris Steven Dyfrig, Thomas Enlli, Morris Jonathan, Stonelake Edmund, Spasic Irena, Fitzpatrick Tess, Evas Jeremy, Rayson Paul
|
Project dates: |
From | To |
---|
1 March 2016 | 30 November 2020 |
|
Date published: |
27 Jan 2021 10:58
|
Last modified: |
31 Jan 2021 21:37
|
Collection period: |
Date from: | Date to: |
---|
1 March 2016 | 30 November 2020 |
|
Country: |
Wales |
Data collection method: |
A sampling frame was created to underpin the data collection for the project, to ensure that we captured a range of different speakers across different discourse contexts and geographical locations. The sampling frame was designed to reflect current demographics of Welsh speakers to ensure that it reflects the contemporary sociolinguistic situation of the language as accurately as possible. Spoken data was sourced via two main approaches: (i) recruitment of participants to be recorded and (ii) recruitment of participants to contribute spoken data via a novel CorCenCC crowdsourcing app. The scope of (i) included not only research assistants going into the field to record speakers but also participants recording themselves in various interactions. This was facilitated through a network of local 'champions' (active language animateurs in targeted areas) or the Mentrau Iaith (each local authority in Wales has an associated Menter Iaith, i.e. community-based organisation dedicated to raising the profile of the Welsh language local language initiatives). Recruitment for (ii) was achieved by publicising the app (for example through social media, television appearances and publicity materials) to endeavour to reach a different cohort of participants who would be recording individually and in more private domains. Large Welsh language events such as the National Eisteddfod and Tafwyl provided opportunities for the team to reach a large cross-section of participants as well as raise general awareness of the project. The crowdsourcing app was made available on IoS, Android and via a web-interface, and campaigns in the media e.g. appearance on television programmes such as S4C's Prynhawn Da, on both Welsh and English medium radio and through local engagement events. Promotional material included pens, coasters, leaflets and postcard size information sheets. An 'unofficial' mascot - based on a cat called Cor-pws - was designed to facilitate the participation of those under 18 and proved popular with contributors of all ages. Facebook and Twitter accounts for CorCenCC were set up in the first months of the project to further enhance the recruitment and participation of contributors. Novel transcription conventions were devised for processing CorCenCC's spoken data (which was captured via the CorCenCC crowdsourcing app or manually, using audio recording devices). These conventions enabled us to fully reflect the whole spectrum of dialect/register variation captured in our speech data (making them more useful to academic researchers) as well as more accurately representing the speech of participants itself. In terms of written data, the good relationship forged at the beginning of the project with Welsh language publishers such as Gwasg y Lolfa led to the incorporation into the corpus of many up to date novels and books. A unique source of written data in the Welsh language is the locally based Papurau Bro (i.e. local community Welsh-language newspapers). Fairly rapid data capture, for example, sampling from the Welsh language academic journal Gwerddon through the Coleg Cymraeg Cenedlaethol and adult L2 pedagogical resources / examination papers through the Welsh Joint Education Committee resulted from our engagement with other project stakeholders in the planning process for the project. Regarding e-language data, website owners and blog authors cooperated generously and targets were exceeded. Contributors of SMS messages and emails were recruited in the same way as for the spoken data. All relevant participant information and descriptive metadata was recorded at the time of data collection. Permissions to share the data in an online public resource were essential to the development of CorCenCC. These permissions were obtained from the relevant legal entities (e.g. the copyright owner; the speaker themselves) before the data was collected and locally stored. The raw data together with the corresponding permissions and metadata were deposited into a local file storage system. |
Observation unit: |
Individual, Organization, Family, Family: Household family |
Kind of data: |
Text |
Type of data: |
Qualitative and mixed methods data |
Resource language: |
Welsh |
|
Data sourcing, processing and preparation: |
We developed a computational infrastructure to support the systematic collection and storage of this large quantity of text and analytic data together with a user-friendly interface to enable interaction with this data online.
All raw data (written, spoken and electronic) was stored within a predefined folder structure, which corresponded to the sampling frame. From there, the data underwent the relevant cleaning and curation processes (including transcription, in the case of spoken data, turning the audio into text).
Once the texts (i.e. spoken, written and e-language) had been converted to plain text format, they were marked up with layers of sociolinguistic metadata (e.g. source, genre, geographical origin) that would be used to query the data, and automatically tagged. First, the use of CyTag (developed by the CorCenCC team), supported text segmentation including sentence splitting and tokenisation as well as part-of-speech (POS) tagging and lemmatisation. Second, to facilitate the semantic analysis of Welsh language data on a large scale, all pre-processed data was further marked up according to semantic categories using the CySemTagger (also developed by the CorcCenCC team).
For more information on CyTag, CySemTagger and the CorCenCC infrastructure, please visit the CorCenCC GitHub page and main project website.
To share the data online, we implemented a web-based interface to the database and have made the corpus dataset available (see Related Resources).
The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool. When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged.
|
Rights owners: |
Name |
Affiliation |
ORCID (as URL) |
Knight Dawn |
Cardiff University |
|
Morris Steve |
Swansea University |
|
Fitzpatrick Tess |
Swansea University |
|
Rayson Paul |
Lancaster University |
|
Spasić Irena |
Cardiff University |
|
Thomas Enlli Môn |
Bangor University |
|
Lovell Alex |
Swansea University |
|
Morris Jonathan |
Cardiff University |
|
Evas Jeremy |
Cardiff University |
|
Stonelake Mark |
Swansea University |
|
Arman Laura |
Cardiff University |
|
Davies Joshua |
Bangor University |
|
Ezeani Igantius |
Lancaster University |
|
Neale Steven |
Cardiff University |
|
Needs Jennifer |
Swansea University |
|
Piao Scott |
Lancaster University |
|
Rees Mair |
Swansea University |
|
Watkins Gareth |
Cardiff University |
|
Williams Lowri |
Cardiff University |
|
Muralidaran Vignesh |
Cardiff University |
|
Tovey-Walsh Bethan |
Swansea University |
|
Anthony Laurence |
Waseda University |
|
Cobb Tom |
University of Quebec at Montreal |
|
Deuchar Margaret |
University of Cambridge |
|
Donnelly Kevin |
None |
|
McCarthy Michael |
The University of Nottingham |
|
Scannell Kevin |
Saint Louis University |
|
|
Contact: |
Name | Email | Affiliation | ORCID (as URL) |
---|
Knight, Dawn | KnightD5@cardiff.ac.uk | Cardiff University | Unspecified |
|
Notes on access: |
The Data Collection is available from an external repository. Access is available via Related Resources.
|
Publisher: |
UK Data Service
|
Last modified: |
31 Jan 2021 21:37
|
|
Available Files
No Files to display
Edit item (login required)
 |
Edit Item |