Transcription textGrids for the audio edition of the British National Corpus 1993

Coleman, John (2019). Transcription textGrids for the audio edition of the British National Corpus 1993. [Data Collection]. Colchester, Essex: UK Data Archive. 10.5255/UKDA-SN-851496

In this research project, Professor John Coleman and his co-workers at Oxford University Phonetics Laboratory and the University of Pennsylvania will study how words are joined together in natural, fluent, everyday speech. In particular, they will do detailed acoustic measurements of numerous recordings to see: how English speakers change the last consonants of words to link them up to the next word; when and in what circumstances people "drop" final 't's and 'd's. The recordings they will use are from thousands of naturally-occurring conversations collected in the 1990's for the British National Corpus. In order to search for and find specific portions of speech, they will use automatic speech recognition technologies. This makes it one of the most ambitious applications of speech recognition technology ever attempted, so the methods they will develop should help future work on searching and finding tools for audio-visual data, such as sound libraries, movie databases etc. It will open up the audio recordings from the British National Corpus for other researchers - and anyone interested in English speech, not just academics! - to find whatever they may be looking for in that vast collection of recordings.

Data description (abstract)

This collection comprises the Praat TextGrids for time-aligned transcriptions of the Audio BNC sound files. Transcriptions are time-aligned at the word and phoneme levels. The collection reflects the state of our transcriptions at the end date of the project. The files, together with the .wav files to which they relate, are also available from the Audio BNC server, http://bnc.phon.ox.ac.uk/. To use the data deposited in this zipfile: 1) Unzip the zipfile. This yields a large folder of Praat TextGrids. 2) The Praat TextGrids may be viewed using Praat software (freely available from www.praat.org), or using any simple text editor. Praat can also display the TextGrid annotation files time-aligned to the Audio BNC audio .wav files. (These audio files are separately available from http://www.phon.ox.ac.uk/AudioBNC; we do not have the rights to upload them to the UK Data Service.) The syntax of the TextGrid file names combines the alphanumeric filename of the corresponding .wav audio file, the 6-digit conversation number employed in the previously-published BNC transcripts and the 3-character alphanumeric transcription/recording code. Thus, 021A-C0897X0004XX-AAZZP0_000406_KDP_1.TextGrid cross-refers to the .wav file http://bnc.phon.ox.ac.uk/data/021A-C0897X0172XX-ABZZP0.wav, and to conversation 000406 from recording KDP, division (<div>) 1. A summary index to all the transcriptions (arranged by three-character BNC code) is given at http://bnc.phon.ox.ac.uk/transcripts-html/ and further details and links about the complete corpus, file naming conventions and on-line locations, is given at http://www.phon.ox.ac.uk/AudioBNC. Publications documenting how this data was collected and prepared, and how we have used it in our research, are available at http://gtr.rcuk.ac.uk/project/CD8C7191-EF60-41B8-BC80-A015ACCEC8EB#tabPublications.

Data creators: