Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019

Stuart-Smith, Jane and Sonderegger, Morgan and Mielke, Jeff (2024). Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019. [Data Collection]. Colchester, Essex: UK Data Service. 10.5255/UKDA-SN-854959

Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language.

Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time.

We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone.

Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English.

Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of different formats and structures.

Data description (abstract)

The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years.
We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only.
Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.

Data creators:

Creator Name	Affiliation	ORCID (as URL)
Stuart-Smith Jane	University of Glasgow	https://orcid.org/0000-0001-7400-9436
Sonderegger Morgan	McGill University	https://orcid.org/0000-0001-7675-2370
Mielke Jeff	North Carolina State University

Contributors:

Name	Affiliation	ORCID (as URL)
McAuliffe Michael	McGill University
Macdonald Rachel	University of Glasgow
Tanner James	McGill University
Willerton Savanna	McGill University

Sponsors:

Economic and Social Research Council, AHRC (UK), SSHRC/CRSH (Canada), NSERC/CRSNG (Canada), NSF (USA)

Grant reference:

ES/R003963/1

Topic classification:

Media, communication and language
Science and technology
Social stratification and groupings
Society and culture

Keywords:

LINGUISTIC ANALYSIS, LINGUISTICS, SPEECH, MEASUREMENTS

Project title:

SPeech Across Dialects of English (SPADE): large-scale digital analysis of a spoken language across space and time

Alternative title:

SPADE

Grant holders:

Jane StuartSmith, Josef Fruehwald, Morgan Sonderegger, Jeff Mielke

Project dates:

From	To
31 August 2017	30 August 2020

Date published:

31 Aug 2021 10:43

Last modified:

21 Feb 2024 13:09

Coverage and Methodology

Temporal coverage:

From	To
1949	2019

Collection period:

Date from:	Date to:
31 August 2017	30 August 2020

Country:

United Kingdom, Ireland, Canada, United States

Spatial unit:

Administrative > Council Areas
Administrative > Counties
Administrative > Countries
Administrative > Regions

Data collection method:

The acoustic measures provided were obtained from speech corpora collected as part of the SPADE project. Many of these were shared by a Data Guardian, an individual or institution with particular responsibility for one or more speech dataset(s), which they have either collected personally for a specific purpose, overseen the collection of, or now curate.
The corpora are either public or private. Public corpora are either freely accessible or are available for sharing via a fee. Private corpora have been collected for a specific purpose, often sociolinguistic or phonetic. Together, the corpora feature speech from the UK, Ireland, Canada and the USA and were sourced in order to obtain good dialect coverage across a variety of social dimensions (e.g. age, gender, class, ethnicity). The speech is in a variety of formats including read speech, public speeches, oral histories and sociolinguistic interviews.
The corpora were either already force-aligned or alignment was carried out as part of the SPADE project. Software developed as part of the SPADE project was then used to obtain vowel durations, static vowel formant measures and sibilant measures from the speech.

Observation unit:

Individual, Text unit

Kind of data:

Numeric, Text

Type of data:

Experimental data , Geospatial data , Historical data, Qualitative and mixed methods data

Resource language:

English

Access and Administration

Data sourcing, processing and preparation:

The vowel and sibilant measures are obtained from speech corpora collected as part of the SPADE project. Further details about these corpora and the measures can be found at the SPADE project OSF data repository (URL below). Corpus Data Guardians indicated specific requirements regarding the use of their corpora, such as the exclusion of person and place names from analysis. For example, including the name of a tiny village in a measures dataset might lead to identification. We met this requirement by minimally “whitelisting” the measures datasets: anonymising all words that are (1) not listed in large electronic English lexicons (Subtlex-US, -UK) and (2) not marked as possible person/place names in the lexicons. This aside, the datasets are in their raw form and will require further processing (e.g. outlier removal) before analysis.

Rights owners:

Name	Affiliation	ORCID (as URL)
Stuart-Smith Jane	University of Glasgow	https://orcid.org/0000-0001-7400-9436
Sonderegger Morgan	McGill University	https://orcid.org/0000-0001-7675-2370
Mielke Jeff	North Carolina State University