From: Sjur N. M. <sju...@ko...> - 2010-02-03 09:15:25
|
Hello, This sounds like a great idea. I have only one point: Den 3. feb. 2010 kl. 06.59 skrev Jacob Myers: > For searching I figured on having a module/namespace to handle that, > perhaps like: > phonetic:search(qnames, "some string") to handle searching - assuming > it can figure out what the options on the index were it needs to use > the same tokengroups and method/mode (I might also expose some other > things like phonetic:metaphone() in here in case someone wants to just > access the library functions). I will also try to figure out a > phonetic:score that can score a match so it is possible to rank sort > the results based on "how many matches" it finds instead of just > finding all documents. The metaphone (and related) algorithm is specifically targeted at English. I would suggest that you in the namespace, documentation, etc. makes this clear, for two reasons: - to make people aware that using the algorithm on other languages might give less desirable results - to not "grab" a seemingly language-independent namespace like "phonetic" for a single language There are also other ways to make a "phonetic" index, depending on language and technology used. I would suggest that the namespace be called "metaphone", but anything descriptive and not too generic would be fine:) A short note on future indexing possibilities for other languages: At Helsinki university, Finland, there is a language technology group working on open-source transducer technology, HFST (hfst.sf.net). As part of the project they have also made a java runtime for the transducers. This opens the door for fast and effective indexation of morphologically rich languages (e.g. Finnish and most other Uralic languages, many of the Germanic languages, etc.) in eXist, where for example only the base-form of each word is indexed. It also makes it easier to make better tokenizers for these languages, handling e.g. punctuation and inflected acronyms and abbreviations in a proper and language-specific way. We have started to use the HFST tools for the languages we work on (Sámi), and I hope we will find the time to start integrating these tools with our eXist-based services this year. Best regards, Sjur N. Moshagen Samediggi · Sametinget Project Manager for the Divvun project http://www.divvun.no/ http://www.samediggi.no/ +358-9-49 75 29 (w) +358-505 634 319 (m) |