Multimedia ResearchISSN:2582-547X

BERT Representation for Arabic Information Retrieval: A Comparative Study

Abstract

Information is rapidly growing in online documents and social media in all languages. Retrieval of information from a language is a high-level task. However, Information Retrieval has become more important in research and commercial development. Presently only a few tools were available in the market for retrieval. Each language has its unique way of pronunciation and language structure. Arabic has a complex morphology. This made it difficult in the advancement of this field. A typical IR model is required to understand similar words in the matching process. In this paper, we presented a comparative study on recent approaches in Arabic Information Retrieval. We implemented and compared all existing approaches for Arabic IR with Arabic datasets. The information retrieval used an Arabic dataset. We also introduced a dictionary, an Arabic Lemmatizer.It contains Arabic words collected from several Arabic books and websites. We compare the performance of different lemmatization techniques. Then we conduct a series of experiments to compare different approaches to Arabic IR. Furthermore, Arabic BERT examined the superior performance with the existing approach's performance. The experimental result showed BM25 and multilingual BERT ranked most for tasks. The Large Arabic Dataset scored an accuracy of 89% in information retrieval.

References

  • Abdelali, A., Hassan, S., Mubarak, H., Darwish, K., &Samih, Y. “Pre-training BERT on Arabic tweets: Practical considerations”,arXiv preprint arXiv:2102.10684, 2021.
  • Abdul-Mageed, M., Elmadany, A., &Nagoudi, E. M. B., “ARBERT & MARBERT: deep bidirectional transformers for Arabic”, arXiv preprint arXiv:2101.01785, 2020.
  • Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017) vol. 117, pp. 256-265, Dubai, UAE, 2017.
  • Abu-Salih, Bilal.,“Applying Vector Space Model (VSM) Techniques in Information Retrieval for Arabic Language”, 2018.
  • Abuzayed, A., & Al-Khalifa, H.,“BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique”, Procedia Computer Science, Vol. 189, pp. 191-194, 2021.
  • Bhoir S., T. Ghorpade and V. Mane, "Comparative analysis of different word embedding models", International Conference on Advances in Computing, Communication and Control (ICAC3), pp. 1-4, 2017.
  • Bojanowski, P., Grave, E., Joulin, A., &Mikolov, T.,"Enriching word vectors with subword information. Transactions of the association for computational linguistics”, Vol. 5, pp. 135-146, 2017
  • Camacho-Collados, J., &Pilehvar, M. T.,“On the role of text preprocessing in neural network architectures: An evaluation study on text categorization and sentiment analysis”, arXiv preprint arXiv:1707.01780, 2017.
  • Devlin, J., Chang, M. W., Lee, K., &Toutanova, K.,"Bert: Pre-training of deep bidirectional transformers for language understanding”, arXiv preprint arXiv:1810.04805, 2018.
  • Elmaz, O.,“An Explorative Journey Through Hadith Collections: Connecting Early Islamic Arabia with the World”, Journal of Arabic and Islamic Studies, Vol. 21, pp. 39-56, 2021.
  • Farghaly A. and K. Shaalan, ‘‘Arabic natural language processing: Challenges and solutions,’’ ACM Trans. Asian Lang. Inf. Process., Vol. 8, no. 4, pp. 14, 2009.
  • Frenz, C. M.,“Introduction to Searching with Regular Expressions”, arXiv preprint arXiv:0810.1732, 2018.
  • Fukushima, K., & Miyake, S.,“Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position”, Pattern recognition, Vol. 15, no. 6, pp. 455-469, 1982.
  • Hasanain, M., Suwaileh, R., Elsayed, T., Kutlu, M., Almerekhi, H., “EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets”, Information Retrieval Journal (2017).
  • He B. and I. Ounis, "Term frequency normalisation tuning for BM25 and DFR models," in European Conference on Information Retrieval, pp. 200-214, 2005.
  • Heikal, M., Torki, M., & El-Makky, N.,“Sentiment Analysis of Arabic Tweets using Deep Learning”, Procedia Computer Science, Vol. 142, pp. 114–122, 2018.
  • Hu, J., Wang, G., Lochovsky, F., Sun, J. T., & Chen, Z.,“Understanding user's query intent with Wikipedia”, In Proceedings of the 18th international conference on World wide web, pp. 471-480, 2009.
  • Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H., &Habash, N.,“The interplay of variant, size, and task type in Arabic pre-trained language models”, arXiv preprint arXiv:2103.06678, 2021.
  • Ismail S. and M. S. Rahman., “Bangla word clustering based on ngram language model,” in Electrical Engineering and Information & Communication Technology (ICEEICT), International Conference on. IEEE, pp. 1–5, 2014.
  • Joho H. and M. Sanderson, “Document frequency and term specificity”, In the Recherche d’InformationAssiste par Ordinateur Conference (RIAO), 2007.
  • Kadhim, A. I.,“Term weighting for feature extraction on Twitter: A comparison between BM25 and TF-IDF”, In 2019 international conference on advanced science and engineering (ICOASE) (pp. 124-128), IEEE, 2019.
  • Kassimi, M. A., & El Beqqali, O.,“3D model classification and retrieval based on semantic and ontology”, International Journal of Computer Science Issues (IJCSI), Vol. 8, no. 5, pp. 108, 2014.
  • Lashkari, A. H., Mahdavi, F., &Ghomi, V.,“A boolean model in information retrieval for search engines”, In 2009 International Conference on Information Management and Engineering (pp. 385-389), IEEE.
  • Metzler Jr, D. A.,“Beyond bags of words: Effectively modeling dependence and features in information retrieval”, University of Massachusetts Amherst, 2007.
  • Mikolov T, Chen K, Corrado G, Dean J, “Efficient Estimation of Word Representations in Vector Space”, In Proceedings of Workshop at ICLR, arXiv; pp. 1301-3781, 2013.
  • Ni, J., Ábrego, G. H., Constant, N., Ma, J., Hall, K. B., Cer, D., & Yang, Y., “Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models”, arXiv preprint arXiv:2108.08877, 2021.
  • Onyenwe, I., Ogbonna, S., Onyedimma, E., Ikechukwu-Onyenwe, O., &Nwafor, C.,“Developing Smart Web- Search Using RegEx”, arXiv preprint arXiv:2110.04767, 2021.
  • Pennington J, Socher R, Manning CD, “Glove: Global Vectors for Word Representation”, In EMNLP, 2014; vol. 14, pp. 1532-1543.
  • Peters,M. E., Neumann,M., Iyyer,M., Gardner,M., Clark, C., Lee, K., and Zettlemoyer, L.,“Deep contextualized word representations”, In Proceedings of NAACLHLT, pp. 2227–2237, 2018.
  • Platzer, C., &Dustdar, S.,“A vector space search engine for web services”, In Third European Conference on Web Services (ECOWS'05) (pp. 9-pp), IEEE, 2005.
  • Prabhakar Raghavan Christopher D. Manning and Hinrich Schtze,“Introduction to information retrieval”, Cambridge University Press, 2008.
  • Radford, A., Narasimhan, K., Salimans, T., &Sutskever, I.,“Improving language understanding by generative pre-training”,2018.
  • Reimers, N., &Gurevych, I.,“Making monolingual sentence embeddings multilingual using knowledge distillation”, arXivprepri, 2020
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... &Polosukhin, I.,“Attention is all you need. Advances in neural information processing systems”, Vol. 30, 2017.
  • Zhai, C., Massung, S.,“Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining”, ACM and Morgan & Claypool (2016).
  • Zhang, X., Ma, X., Shi, P., & Lin, J.,“Mr. TyDi: A multi-lingual benchmark for dense retrieval”, arXiv preprint arXiv:2108.08787, 2021
  • Alghanmi, I., Espinosa-Anke, L., &Schockaert, S., “Combining BERT with static word embeddings for categorizing social media”, 2020.
  • Antoun, W., Baly, F., & Hajj, H., “Arabert: Transformer-based model for arabic language understanding”, arXiv preprint arXiv:2003.00104, 2020