A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets.
The cleaning process includes removing the XML tags and strange symbols, as well as fixing diacritics errors. After that, the tokenization is performed while focusing on the extraction of the Arabic words. The result is a space-separated tokens file, where the words and the numbers are separated, but not the sequences of punctuation (ie, an ending parenthesis followed by a dot). The sentence segmentation is done at usual punctuations such as dots, commas, interrogation/exclamation marks, and line end as well.
The partition process is done by shuffling groups of sentences then dividing each group into three parts (Train/Val/Test) and storing them in individual files.

The original Tashkeela dataset is available at https://sourceforge.net/projects/tashkeela/

Features

  • Raw fully-diacritized Arabic texts.
  • Over 3 million sentences with different number of words.
  • Mostly Classical Arabic.
  • Space separated tokens.
  • 90% training , 5% validation and 5% testing data.

Project Activity

See All Activity >

License

GNU General Public License version 2.0 (GPLv2)

Follow Tashkeela processed

Tashkeela processed Web Site

Other Useful Business Software
Gen AI apps are built with MongoDB Atlas Icon
Gen AI apps are built with MongoDB Atlas

The database for AI-powered applications.

MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.
Start Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Tashkeela processed!

Additional Project Details

Languages

Arabic

Intended Audience

Developers, Science/Research

Programming Language

Python

Registered

2019-12-05