

Dr. Mourad Abbas

The Arabic Corpus {compiled by Dr. Mourad Abbas ( ) is composed of arabic texts for text categorization. The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the main references:

(1) For Watan-2004 corpus

  • M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods on Arabic Corpora,JOURNAL OF DIGITAL INFORMATION MANAGEMENT,vol. 9, N. 5, pp.185-192.

  • M. Abbas, K. Smaili, and D. Berkani. "Evaluation of Topic Identification Methods for Arabic Texts and their Combination by using a Corpus Extracted from the Omani Newspaper Alwatan." Arab Gulf Journal of Scientific Research 29.3-4 (2011): 183-191.

  • Abbas, M., Smaili, K., & Berkani, D. (2010, August). Efficiency of TR-Classifier versus TFIDF. In Integrated Intelligent Computing (ICIIC), 2010 First International Conference on (pp. 233-237). IEEE.

  • M. Abbas, K. Smaili, D. Berkani. (2009). Multi-Category Support Vector Machines for Identifying Arabic Topics. Advances in Computaional Linguistics, Special issue of Journal of Research in computing Science. Vol. 41, pp.217-226, Volume Editor Alexander Gelbukh.

2) For Khaleej-2004 corpus

  • M. Abbas, K. Smaili (2005) Comparison of Topic Identification Methods for Arabic Language, RANLP05 : Recent Advances in Natural Language Processing ,pp. 14-17, 21-23 september 2005, Borovets, Bulgary.

Other references:

  • P.S: Any use of this corpus in order to create other ressources or software must have the authorization
    of Mourad Abbas.

Project Admins: