@nahrainuniv.edu.iq
Computer science
Al-Nahrain University
Ph.D. Coputer Science - Western Michigan University
MSc. Computer science Al-Nahrain University
Machine Learning - Speech processing - Natural Language processing
Scopus Publications
Tiba Zaki Abdulhameed, Suhad A. Yousif, Venus W. Samawi, and Hasnaa I. Al-Shaikhli
Institute of Electrical and Electronics Engineers (IEEE)
Rabia Emhamed Al Mamlook, Tiba Zaki Abdulhameed, Raed Hasan, Hasnaa Imad Al-Shaikhli, Ihab Mohammed, and Shadha Tabatabai
IEEE
Car crash can cause serious and severe injuries that impact people every day. Those injuries could be especially damaging for elderly drivers of age 60 or more. The goal of this research is to investigate the risk factors that contribute to crash injury severity among elderly drivers. This is accomplished by designing accurate machine learning based predictive models. Naïve Bayesian (NB), Decision Tree (DT), Logistic Regression (LR), Light-GBM, and Random Forest (RF) model are proposed. A set of influential factors are selected to build the five predictive models to classify the severity of injuries as severe injury or non-severe injury. Michigan traffic data of the elderly population is used in this paper. Data normalization and Synthetic Minority Oversampling Technique (SMOTE) as injury classes balancing technique are used in the pre-processing phase. Results show that the Light-GBM achieved the highest accuracy among the five tested models with 87%. According to the Light-GBM model, the three most important factors that impact the severity of injuries are the driver's age, traffic volume, and car's age.
Tiba Zaki Abdulhameed, Imed Zitouni, and Ikhlas Abdel-Qader
Association for Computing Machinery (ACM)
Word clustering is a serious challenge in low-resource languages. Since words that share semantics are expected to be clustered together, it is common to use a feature vector representation generated from a distributional theory-based word embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low-resource Iraqi vocabulary. We began with a new Dialect Fast Stemming Algorithm (DFSA) that utilizes the MSA data. The proposed algorithm achieved 0.85 accuracy measured by the F1 score. Then, the distributional theory-based word embedding method and a new simple, yet effective, feature vector named Wasf-Vec word embedding are tested. Wasf-Vec word representation utilizes a word’s topology features. The difference between Wasf-Vec and distributional theory-based word embedding is that Wasf-Vec captures relations that are not contextually based. The embedding is followed by an analysis of how the dialect words are clustered within other MSA words. The analysis is based on the word semantic relations that are well supported by solid linguistic theories to shed light on the strong and weak word relation representations identified by each embedding method. The analysis is handled by visualizing the feature vector in two-dimensional (2D) space. The feature vectors of the distributional theory-based word embedding method are plotted in 2D space using the t-sne algorithm, while the Wasf-Vec feature vectors are plotted directly in 2D space. A word’s nearest neighbors and the distance-histograms of the plotted words are examined. For validation purpose of the word classification used in this article, the produced classes are employed in Class-based Language Modeling (CBLM). Wasf-Vec CBLM achieved a 7% lower perplexity (pp) than the distributional theory-based word embedding method CBLM. This result is significant when working with low-resource languages.
Tiba Zaki Abdulhameed, Imed Zitouni, and Ikhlas Abdel-Qader
IEEE
Neural word embedding, such as word2vec, produces very large features' vectors. In this paper, we are investigating the length of the feature vector aiming to optimize the word representation results, and also to speed up the algorithm by addressing noise impact. Principal Component Analysis (PCA) has a proven record in dimensionality reduction as we selected it to achieve our objectives. We also selected class based Language Modeling as extrinsic evaluation of the features vectors and are using Perplexity (pp) as our metric. K-means clustering is used as words classification. The execution time of the classification is also computed. As a result, we concluded that for a given test data, if the training data is of same domain then large vector size can increase the precision of describing word relations. In contrast, if the training data is from different domain and contains large amount of contexts not expected to occur in the test data then a small vector size will give a better description to help reducing the noise effect on clustering decisions. Two different data training domains were used in this analysis; Modern Standard Arabic (MSA) broadcast news and reports, and Iraqi phone conversations with testing data of the same Iraqi data domain. Depending on this analysis, same domain training data and test data have execution times reduced by 61% while keeping same representation efficiency. In addition, for different domain training data i.e. MSA, pp reduction ratio of 6.7% is achieved with time reduced by 92%. This implies the importance of carefully choosing feature vector size on the overall performance.