From: Ravi S. <rav...@ya...> - 2007-12-19 17:55:12
|
Hello, my name is Ravi, and I am a graduate student at the University of North Texas. We have been trying to use the InfoMap software to train models in Spanish. We know that it works flawlessly for English, and we stumbled upon the fact that the files 'valid characters' and 'stop list' need to be changed for it to function for any other language. However, it appears that we are having some character encoding issues, because it turns out that apparently the built model doesn't have vectors for any Spanish word which has diacritics. I noticed from the users' mailing list that there have been some issues regarding UTF-8 characters, and especially the following thread mentions that several functions in the C code will need to be changed to accommodate UTF-8 characters: http://sourceforge.net/mailarchive/message.php?msg_id=20040311202335.GB22850%40Turing.stanford.edu The thread doesn't talk about the availability of such libraries in C, and it also doesn't talk about ~all~ the functions that would need to be changed. However, I was wondering if you have heard of similar issues specific to the Spanish language (which is specifically Latin-1) and how to resolve those? In the meantime, one workaround that I've been thinking about is replacing all the diacritic characters with my own 'special notations' in the training corpus as well as the stop list, (for example replacing N(tilde) with something like NXX etc.) This should work, but in case you have heard of another solution to this issue please let us know. That will save us the overhead of conversions at various levels in our project. Thanking you in anticipation, Sincerely Ravi ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs |