The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) is composed of arabic texts for text categorization. The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the main references:
(1) For Watan-2004 corpus
M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods on Arabic Corpora,JOURNAL OF DIGITAL INFORMATION MANAGEMENT,vol. 9, N. 5, pp.185-192.
M. Abbas, K. Smaili, and D. Berkani. "Evaluation of Topic Identification Methods for Arabic Texts and their Combination by using a Corpus Extracted from the Omani Newspaper Alwatan." Arab Gulf Journal of Scientific Research 29.3-4 (2011): 183-191.
Abbas, M., Smaili, K., & Berkani, D. (2010, August). Efficiency of TR-Classifier versus TFIDF. In Integrated Intelligent Computing (ICIIC), 2010 First International Conference on (pp. 233-237). IEEE.
M. Abbas, K. Smaili, D. Berkani. (2009). Multi-Category Support Vector Machines for Identifying Arabic Topics. Advances in Computaional Linguistics, Special issue of Journal of Research in computing Science. Vol. 41, pp.217-226, Volume Editor Alexander Gelbukh.
2) For Khaleej-2004 corpus
Other references:
http://sites.google.com/site/mouradabbas9/publications/international-conferences
http://sites.google.com/site/mouradabbas9/publications/pub