The TTC-3600 data set is a collection of Turkish news and articles including categorized 3,600 documents from 6 well-known portals in Turkey. The name of the portals are as follows:
1. http://dosyalar.hurriyet.com.tr/rss.
2. http://www.posta.com.tr/rss.
3. http://www.iha.com.tr/rss.html.
4. http://www.haberturk.com/rss.
5. http://www.radikal.com.tr/rss/.
6. http://www.zaman.com.tr/rss_rssMainPage.action?sectionId=341.
This data set is created in order to perform text mining operations on Turkish and make experimental results re-producable. The TTC-3600 data set has 4 different forms in terms of pre-processing:
1. Original: No pre-processing step is applied.
2. FPS-5: The first five characters of terms are selected as stem and stop-words elimination is performed.
3. FPS-7: The first seven characters of terms are selected as stem and stop-words elimination is performed.
4. Zemberek-Stemmed: Zemberek NLP toolkit is utilized for stemming and stop-words elimination is perfo
TurkishTextCategorizationProject
Brought to you by:
mericargac
Downloads:
0 This Week