Home
Name Modified Size InfoDownloads / Week
README.txt 2016-01-18 1.8 kB
Totals: 1 Item   1.8 kB 1
Classical Arabic Corpus
===========================
The classical Arabic corpus has been developed as part of a master thesis named "Edit Distance Adapted to Natural Language Words". The available project consists three parts: corpus, files and index. The project offers the open source database and text files for free under a GNU General Public License (GPL) 2.0.



1. Corpus
------------
	The corpus gathers Arabic texts dating back from 431 to 1104 (in Hijri between 130 BH and 498 H). It counts around one hundred million (121,799,416) words in total, in addition to more than one million (1,138,616) distinct words with an average size of six (6.220) letters per word. Where these words consist only of Arabic letters without diacritics, numbers or symbols.
	The database is available with different formats (.sql and .csv) for using in different applications. The table combines between the words with their occurrence in the texts. The total table size is 41.56 MB. It sets CHARSET to cp1256 and COLLATE to cp1256_bin. 


	
2. Files
------------
	The text files are Arabic resources. They are in txt format with two different encoding: UTF and CP1256. The name of files (*.*.txt) denotes (#author.#his resource.txt). 
	
	
	
3. Index
------------
	The index sheet is a table of historical information for the legacy texts. Each record in the index presents some or all of the following information  according to availability: {name of resource - author - date of death - place of (birth, death, living) - number of words - reference}.
	The index is wrote in the Arabic language. It orders the files by the authors. The total number of authors is around 364 that wrote 1,060 Arabic texts.

	

Contact
-------------
Questions or feedback?
Email: abeer.alsheddi@gmail.com
Source: README.txt, updated 2016-01-18