Menu

Home

Meriç ARGAÇ

The TTC-3600 data set is a collection of Turkish news and articles including categorized 3,600 documents from 6 well-known portals in Turkey. The name of the portals are as follows:
1. http://dosyalar.hurriyet.com.tr/rss.
2. http://www.posta.com.tr/rss.
3. http://www.iha.com.tr/rss.html.
4. http://www.haberturk.com/rss.
5. http://www.radikal.com.tr/rss/.
6. http://www.zaman.com.tr/rss_rssMainPage.action?sectionId=341.

This data set is created in order to perform text mining operations on Turkish and make experimental results re-producable. The TTC-3600 data set has 4 different forms in terms of pre-processing:

  1. Original: No pre-processing step is applied.
  2. FPS-5: The first five characters of terms are selected as stem and stop-words elimination is performed.
  3. FPS-7: The first seven characters of terms are selected as stem and stop-words elimination is performed.
  4. Zemberek-Stemmed: Zemberek NLP toolkit is utilized for stemming and stop-words elimination is performed.

Each form of TTC-3600 dataset includes two types of files. The first file with ".txt" extension contains the names and contents of the features whereas the second file in ARFF (Attribute-Relation File Format) Weka format that describes a list of instances sharing a set of features.