Arabic Corpus Wiki

Text categorization, arabic language processing, language modeling

Status: Planning

Brought to you by: mouradabbas

Home

Authors:

The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) is composed of arabic texts for text categorization. The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the main references:

(1) For Watan-2004 corpus

M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods on Arabic Corpora,JOURNAL OF DIGITAL INFORMATION MANAGEMENT,vol. 9, N. 5, pp.185-192.
M. Abbas, K. Smaili, and D. Berkani. "Evaluation of Topic Identification Methods for Arabic Texts and their Combination by using a Corpus Extracted from the Omani Newspaper Alwatan." Arab Gulf Journal of Scientific Research 29.3-4 (2011): 183-191.
Abbas, M., Smaili, K., & Berkani, D. (2010, August). Efficiency of TR-Classifier versus TFIDF. In Integrated Intelligent Computing (ICIIC), 2010 First International Conference on (pp. 233-237). IEEE.
M. Abbas, K. Smaili, D. Berkani. (2009). Multi-Category Support Vector Machines for Identifying Arabic Topics. Advances in Computaional Linguistics, Special issue of Journal of Research in computing Science. Vol. 41, pp.217-226, Volume Editor Alexander Gelbukh.

2) For Khaleej-2004 corpus

M. Abbas, K. Smaili (2005) Comparison of Topic Identification Methods for Arabic Language, RANLP05 : Recent Advances in Natural Language Processing ,pp. 14-17, 21-23 september 2005, Borovets, Bulgary.

Other references:
http://sites.google.com/site/mouradabbas9/publications/international-conferences
http://sites.google.com/site/mouradabbas9/publications/pub

P.S: Any use of this corpus in order to create other ressources or software must have the authorization
of Mourad Abbas.

Project Admins:

Dr. Mourad Abbas