I just imported a new CVS module, "abbrevinducer." This is a (very) simple C program that can induce abbreviations from English language text in ASCII encoded files. The module includes the GNU "autobuild" files, so you should be able to just ./configure; make; make install to get a copy.
The README has some notes on how to use the tool's output to produce a unique list of abbreviations above some cutoff. That list can then turned into a java.util.Set and given to the sentence detecting ContextGenerator.
If anyone has experience with writing UTF-8 handling C code, I'd be glad for some pointers. I wrote this in C because I wanted to (smile) and because I was working with a large datafile that just took too long to process with Java. It should work OK on non ASCII input that's 8-bit encoded (ISO-8859-1/-15, for example) as long as `.' indicates the presence of an abbreviation. Somehow I suspect that this isn't the case for ideographic languages....
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just imported a new CVS module, "abbrevinducer." This is a (very) simple C program that can induce abbreviations from English language text in ASCII encoded files. The module includes the GNU "autobuild" files, so you should be able to just ./configure; make; make install to get a copy.
The README has some notes on how to use the tool's output to produce a unique list of abbreviations above some cutoff. That list can then turned into a java.util.Set and given to the sentence detecting ContextGenerator.
If anyone has experience with writing UTF-8 handling C code, I'd be glad for some pointers. I wrote this in C because I wanted to (smile) and because I was working with a large datafile that just took too long to process with Java. It should work OK on non ASCII input that's 8-bit encoded (ISO-8859-1/-15, for example) as long as `.' indicates the presence of an abbreviation. Somehow I suspect that this isn't the case for ideographic languages....