Tashkeela processed

Tashkeela dataset cleaned and normalized.

Brought to you by: hamza5

As of 2021-01-10, this project can be found here.

Downloads: 0 This Week

Last Update: 2020-12-21

Get an email when there's a new version of Tashkeela processed

A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets.
The cleaning process includes removing the XML tags and strange symbols, as well as fixing diacritics errors. After that, the tokenization is performed while focusing on the extraction of the Arabic words. The result is a space-separated tokens file, where the words and the numbers are separated, but not the sequences of punctuation (ie, an ending parenthesis followed by a dot). The sentence segmentation is done at usual punctuations such as dots, commas, interrogation/exclamation marks, and line end as well.
The partition process is done by shuffling groups of sentences then dividing each group into three parts (Train/Val/Test) and storing them in individual files.

The original Tashkeela dataset is available at https://sourceforge.net/projects/tashkeela/

Features

Raw fully-diacritized Arabic texts.
Over 3 million sentences with different number of words.
Mostly Classical Arabic.
Space separated tokens.
90% training , 5% validation and 5% testing data.

Project Activity

See All Activity >

License

GNU General Public License version 2.0 (GPLv2)

Follow Tashkeela processed

Tashkeela processed Web Site

Other Useful Business Software

Go From AI Idea to AI App Fast Icon

Go From AI Idea to AI App Fast

One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.

Try Free

Rate This Project

Login To Rate This Project

User Reviews

Be the first to post a review of Tashkeela processed!

Additional Project Details

Languages

Arabic

Intended Audience

Developers, Science/Research

Programming Language

Registered

2019-12-05

Report inappropriate content

Go From AI Idea to AI App Fast

One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.

Try Free

Recommended Projects

Tashkeela: Arabic diacritization corpus
Tashkeela: Arabic discritization Corpus (Vocalized texts)
Dawarich
Self-hostable alternative to Google Timeline
Arabic Translitrator
The early version of this freeware (utility) transliterates ALA-LC romanized text to Arabic characters text. The latest version can transliterate from German DMG(DIN31635) and ALA-LC Arabic romanization standards.
Arabic Letters for Adobe Animate
Display Arabic characters correctly inside Adobe Animate (or Flash)
EZ Arabic
Learning Arabic is not an easy thing. It doesn't help that there are not many resources available to learn for free. Looking around you can find software costing hundreds of dollars.EZ Arabic is an open source project with the missionary in mind.