Orthography, phonology and morphology in the Arabic lexicon

Cahill, Lynne (2017). Orthography, phonology and morphology in the Arabic lexicon. [Data Collection]. Colchester, Essex: Economic and Social Research Council. 10.5255/UKDA-SN-850541

Data description (abstract)

Arabic script is essentially alphabetic, that is it uses different characters based on the pronunciation of words. However, much Arabic writing only includes the consonants, meaning that there is a lot of ambiguity where a written word could represent many different actual words or forms of those words.
This project aims to apply a framework previously developed for mapping between spelling and pronunciation in European languages (English, Dutch, German and French) to define the relations between written and spoken forms in Modern Standard Arabic and then to apply a set of probabilities, extracted from Arabic corpora, to determine which of the possible pronunciations of a particular written form is the most likely.
The resulting lexicon will be useful for a range of Arabic NLP (Natural Language Processing) applications, and the structure of the lexicon means that it will be possible to extend it to cover different varieties of Arabic.

Data creators:
Creator Name Affiliation ORCID (as URL)
Cahill Lynne University of Brighton
Sponsors: Economic and Social Research Council
Grant reference: RES-000-22-3868
Topic classification: Media, communication and language
Date published: 29 Sep 2011 12:34
Last modified: 11 Jul 2017 09:48

Available Files

Data and documentation bundle

Downloads

data downloads and page views since this item was published

View more statistics

Altmetric

No resources to display

Edit item (login required)

Edit Item Edit Item