Date of Award

12-2018

Degree Name

Doctor of Philosophy

Department

Computer Science

First Advisor

Dr. Alvis Fong

Second Advisor

Dr. Elise de Doncker

Third Advisor

Dr. Ikhlas Abdel-Qader

Keywords

sentiment, Arabic, analysis, lexicon, stemmer, tokens

Abstract

Sentiment analysis is a type of text mining that uses Natural Language Processing (NLP) tools to identify and label opinionated text. There are two main approaches of sentiment analysis: lexicon-based, and statistical approach. In our research, we use the lexicon-based approach because the lexicon contains sentiment words and phrases which are the main linguistic units to express sentiments. More specifically, we work with domain-oriented lexicons as they are more efficient than general ones because the polarity is heavily driven by domains.

Arabic language has a degree of uniqueness that makes it hard to be processed with the available cross-language tools or use the direct translation from English. Arabic has 28 letters, and with the letters variations and vocalizations of letters, each letter might take up to 9 or more different shapes. Arabic is highly inflectional and morphological language, which makes it hard compared to English from features detection and dimension reduction perspectives. So, to get more accuracy, statistical learning methods have to supported by language-specific knowledge.

In this research, we propose an approach called Polarity Latent Dirichlet Allocation (pLDA) to construct domain-oriented lexicon for an Arabic language domain. We first created our own training data, and we built our sentiment lexicon manually. After that, the process was automated using the statistical model Latent Dirichlet Allocation (LDA), and then, the manual and automated results were compared. the two weaknesses of LDA were alleviated in our model by resolving the hyper-parameters problem and by enriching the corpora with more features of overall rated corpus. The lexicon was tested and validated for classification tasks with variety of data set sizes, number of classes, and imbalance ratios. We designed rule-based fuzzy system especially to test our lexicon, and our approach showed excellent results as we got between 81% and 92% accuracy (according to text length and lexicon size).

Next step was a dimension reduction system of Arabic language. We developed a new stemmer for Arabic language and introduced it as an R package called arStemmer1. We compared our stemmer with the well known stemmer, Khoja stemmer which is one of the best performing stemmers. Our stemmer arStemmer1 outperformed Khoja in six out of seven experiments. We employed deep learning (skip-gram model) to build stop words lists with some manual filtration. The R package arStemmer1 is available for researchers to use and test.

Access Setting

Dissertation-Open Access

Share

COinS