Cross Language Information Transfer Between Modern Standard Arabic and Its Dialects – a Framework for Automatic Speech Recognition System Language Model
Date of Award
Doctor of Philosophy
Dr. Ikhlas Abdel-Qader
Dr. Imed Zitouni
Dr. John Kapenga
Dr. Robert Trenary
Word embedding, speech recognition, language modeling, dialect ontology, natural language processing, words classification
Significant advances have been made with Modern Standard Arabic (MSA) Automatic Speech Recognition (ASR) applications. Yet, dialectal conversation ASR is still trailing behind due to limited language resources. As is the case in most cultures, the formal Modern Standard Arabic language is not used in daily life. Instead, varieties of regional dialects are spoken, which creates a dire need to address dialect ASR systems. Processing MSA language naturally poses considerable challenges that are passed on to the processing of its derived dialects. In dialects, many words have gradually morphed from MSA pronunciations and at many times have different usages. Also, a significant number of new vocabulary words have been imported from other foreign languages. In addition to these issues, dialects have low resources to be considered for any meaningful natural language processing (NLP) research. Therefore, there is a pressing need for an efficient language model (LM) for deployment in Arabic conversational speech recognition systems.
In this thesis, we explore building an Iraqi dialect conversational speech language model based on utilizing MSA data. Because there isn’t a pre-defined annotated vocabulary set, our main approach is making use of word embedding for unsupervised clustering of the MSA-Iraqi dialect words. Clustering the dialect words within the relative MSA words is employed to create a class-based LM. This allows the use of MSA data to cover the insufficiency of the dialect data. The model uses the dialect word’s statistical history in addition to the statistics of related MSA words to make predictions of the intended spoken word sequence. Thus, efficient word embedding becomes important to produce a reliable LM.
To achieve efficient word embedding, first an analysis of the MSA and the Iraqi dialect vocabulary sets and their context intersection is conducted. For this purpose, Dialect Fast Stemming Algorithm (DFSA) is proposed that utilizes the MSA data and a predefined dialect suffixes set. The intersection set enlarged from 42.8% to 54% of the Iraqi vocabulary, and from 8% to 13% of the MSA vocabulary. Second, the syntax and semantic feature vector that is produced by applying the distributional-theory-based word embedding word2vec contained noise from having contexts that appear in MSA or in the dialect solely; thus, applying PCA reduced the perplexity (pp) by 6.7%. Finally, the novel Wasf-Vec topological word embedding algorithm is proposed, which relies on the hypothesis that for a rich morphological language like Arabic, the word’s topological feature is of much significance to be considered. This new feature extraction technique addresses the high morphological properties and reduces PP by 7% when using distributional-theory-based word embedding. Moreover, a deep analysis of the words syntagmatic and paradigmatic relations are illustrated based on solid Arabic and Greek linguistic theories that prove the need of topological word embedding.
The three researches compiling this dissertation demonstrate the feasibility of utilizing MSA resources to enhance dialect processing. Further, combining distributional-theory-based and Topology-based word embedding is highly of great intense for future investigation.
Abdulhameed, Tiba Zaki, "Cross Language Information Transfer Between Modern Standard Arabic and Its Dialects – a Framework for Automatic Speech Recognition System Language Model" (2020). Dissertations. 3629.