Coleman, John (2016). Transcription textGrids for the audio edition of the British National Corpus. [Data Collection]. Colchester, Essex: UK Data Archive. 10.5255/UKDA-SN-851496

In this research project, Professor John Coleman and his co-workers at Oxford University Phonetics Laboratory and the University of Pennsylvania will study how words are joined together in natural, fluent, everyday speech. In particular, they will do detailed acoustic measurements of numerous recordings to see: how English speakers change the last consonants of words to link them up to the next word; when and in what circumstances people "drop" final 't's and 'd's. The recordings they will use are from thousands of naturally-occurring conversations collected in the 1990's for the British National Corpus. In order to search for and find specific portions of speech, they will use automatic speech recognition technologies. This makes it one of the most ambitious applications of speech recognition technology ever attempted, so the methods they will develop should help future work on searching and finding tools for audio-visual data, such as sound libraries, movie databases etc. It will open up the audio recordings from the British National Corpus for other researchers - and anyone interested in English speech, not just academics! - to find whatever they may be looking for in that vast collection of recordings.

This collection comprises the Praat TextGrids for time-aligned transcriptions of the Audio BNC sound files. Transcriptions are time-aligned at the word and phoneme levels. The collection reflects the state of our transcriptions at the end date of the project. The files, together with the .wav files to which they relate, are also available from the Audio BNC server, To use the data deposited in this zipfile: 1) Unzip the zipfile. This yields a large folder of Praat TextGrids. 2) The Praat TextGrids may be viewed using Praat software (freely available from, or using any simple text editor. Praat can also display the TextGrid annotation files time-aligned to the Audio BNC audio .wav files. (These audio files are separately available from; we do not have the rights to upload them to the UK Data Service.) The syntax of the TextGrid file names combines the alphanumeric filename of the corresponding .wav audio file, the 6-digit conversation number employed in the previously-published BNC transcripts and the 3-character alphanumeric transcription/recording code. Thus, 021A-C0897X0004XX-AAZZP0_000406_KDP_1.TextGrid cross-refers to the .wav file, and to conversation 000406 from recording KDP, division (<div>) 1. A summary index to all the transcriptions (arranged by three-character BNC code) is given at and further details and links about the complete corpus, file naming conventions and on-line locations, is given at Publications documenting how this data was collected and prepared, and how we have used it in our research, are available at

Creator NameEmailAffiliationORCID (as URL)
Coleman, of OxfordUnspecified
Research funders: ESRC
Grant reference: RES-062-23-2566
Subjects: Media, communication and language
Major studies and data
Population, vital statistics and censuses
Keywords: dialects, sound recordings, english (language)
Project title: Word joins in real-life speech: a large corpus-based study
Grant holders: John Coleman, Rosalind Temple, Jiahong Yuan
Project dates:
1 November 201030 April 2014
Date published: 21 Aug 2014 11:19
Last modified: 12 May 2016 14:45

