Orthography, phonology and morphology in the Arabic lexicon

Cahill, Lynne (2017). Orthography, phonology and morphology in the Arabic lexicon. [Data Collection]. Colchester, Essex: Economic and Social Research Council. 10.5255/UKDA-SN-850541

Data description (abstract)

Arabic script is essentially alphabetic, that is it uses different characters based on the pronunciation of words. However, much Arabic writing only includes the consonants, meaning that there is a lot of ambiguity where a written word could represent many different actual words or forms of those words.
This project aims to apply a framework previously developed for mapping between spelling and pronunciation in European languages (English, Dutch, German and French) to define the relations between written and spoken forms in Modern Standard Arabic and then to apply a set of probabilities, extracted from Arabic corpora, to determine which of the possible pronunciations of a particular written form is the most likely.
The resulting lexicon will be useful for a range of Arabic NLP (Natural Language Processing) applications, and the structure of the lexicon means that it will be possible to extend it to cover different varieties of Arabic.

Data creators: