Tashkeela processed Code

Tashkeela dataset cleaned and normalized.

Brought to you by: hamza5

Tree [c15766] master / History

HTTPS access

File	Date	Author	Commit
README.txt	2019-12-12	Hamza Abbad	[c15766] Added the README.txt file.
dataset_processing.py	2019-12-12	Hamza Abbad	[28058c] Change SENTENCE_TOKENIZATION_REGEXP to not affe...

Read Me

Tashkeela processed: Tashkeela dataset cleaned and normalized.

A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized
text, then divided into training, development, and testing sets.

The cleaning process includes removing the XML tags and strange symbols, as well as fixing diacritics errors. After
that, the tokenization is performed while focusing on the extraction of the Arabic words. The result is a
space-separated tokens file, where the words and the numbers are separated, but not the sequences of punctuation
(ie, an ending parenthesis followed by a dot). The sentence segmentation is done at usual punctuations such as dots,
commas, interrogation/exclamation marks, and line end as well.

The partition process is done by shuffling groups of sentences then dividing each group into three parts
(Train/Val/Test) and storing them in individual files.

The original Tashkeela dataset is available at:
https://sourceforge.net/projects/tashkeela/

The script used for the processing is distributed freely at:
https://sourceforge.net/p/tashkeela-processed/code/
The parameters used were the default ones available in the script.

Tashkeela processed Code

Tashkeela dataset cleaned and normalized.

Branches

Tree [c15766] master / Download Snapshot History

Read Me

Tree [c15766] master /

History