I am Phd student in text minning. I need to index an arabic collection then browse the result to print out all entries in the term-to-document index and document-to-term index of the collection.
I have successfuly build my .key index via the "BuildIndex" command (with the Lemur Toolkit), I have specify the "arabicStemFunc" to arabic_light10_stop , docFormat to arabic with TREC format, Windows CP1256 encoding.
The problem is that when i try to Dumpdoc to view index result the output was not readable even if i store the resualt of DumpIndex to a file.
I tried to browse the index with api lemur but i still getting an output not readable.
so I was wondering if it is an encoding problem? I can use IndriBuildIndex on Arabic utf-8 instead of BuildIndex? In this case what about the index ? will be different
I appreciate if you could help me in this
I mean even if store the resualt of Dumpdoc to a file it's still not readable
The out put of the console look like this