Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019

Stuart-Smith, Jane and Sonderegger, Morgan and Mielke, Jeff (2024). Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019. [Data Collection]. Colchester, Essex: UK Data Service. 10.5255/UKDA-SN-854959

Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language. Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time. We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone. Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English. Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of different formats and structures.

Data description (abstract)

The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.

Data creators:
Creator Name Affiliation ORCID (as URL)
Stuart-Smith Jane University of Glasgow https://orcid.org/0000-0001-7400-9436
Sonderegger Morgan McGill University https://orcid.org/0000-0001-7675-2370
Mielke Jeff North Carolina State University
Contributors:
Name Affiliation ORCID (as URL)
McAuliffe Michael McGill University
Macdonald Rachel University of Glasgow
Tanner James McGill University
Willerton Savanna McGill University
Sponsors: Economic and Social Research Council, AHRC (UK), SSHRC/CRSH (Canada), NSERC/CRSNG (Canada), NSF (USA)
Grant reference: ES/R003963/1
Topic classification: Media, communication and language
Science and technology
Social stratification and groupings
Society and culture
Keywords: LINGUISTIC ANALYSIS, LINGUISTICS, SPEECH, MEASUREMENTS
Project title: SPeech Across Dialects of English (SPADE): large-scale digital analysis of a spoken language across space and time
Alternative title: SPADE
Grant holders: Jane StuartSmith, Josef Fruehwald, Morgan Sonderegger, Jeff Mielke
Project dates:
FromTo
31 August 201730 August 2020
Date published: 31 Aug 2021 10:43
Last modified: 21 Feb 2024 13:09

Available Files

No Files to display

Downloads

data downloads and page views since this item was published

View more statistics

Altmetric

Edit item (login required)

Edit Item Edit Item