Arabic Corpus


Text categorization, arabic language processing, language modeling

Add a Review
15 Downloads (This Week)
Last Update:
Download watan-2004.7z
Browse All Files
Windows Linux


The Arabic Corpus {compiled by Dr. Mourad Abbas ( ) is composed of arabic texts for text categorization. The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references:
(1)For Watan-2004 corpus
M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods on Arabic Corpora,JOURNAL OF DIGITAL INFORMATION MANAGEMENT,vol. 9, N. 5, pp.185-192.
2) For Khaleej-2004 corpus
M. Abbas, K. Smaili (2005) Comparison of Topic Identification Methods for Arabic Language, RANLP05 : Recent Advances in Natural Language Processing ,pp. 14-17, 21-23 september 2005, Borovets, Bulgary.

Other references:

Arabic Corpus Web Site

Update Notifications

Write a Review

User Reviews

Be the first to post a review of Arabic Corpus!

Additional Project Details


Arabic, Dutch, English, French

Intended Audience

Advanced End Users, Developers, Engineering, Information Technology, Quality Engineers, Science/Research

User Interface

KDE, Win32 (MS Windows)

Programming Language

C++, JavaScript, Python


Screenshots can attract more users to your project.
Features can attract more users to your project.

Icons must be PNG, GIF, or JPEG and less than 1 MiB in size. They will be displayed as 48x48 images.