Eric Friedman - 2001-12-27

I just imported a new CVS module, "abbrevinducer."  This is a (very) simple C program that can induce abbreviations from English language text in ASCII encoded files.  The module includes the GNU "autobuild" files, so you should be able to just ./configure; make; make install to get a copy. 

The README has some notes on how to use the tool's output to produce a unique list of abbreviations above some cutoff.  That list can then turned into a java.util.Set and given to the sentence detecting ContextGenerator.

If anyone has experience with writing UTF-8 handling C code, I'd be glad for some pointers.  I wrote this in C because I wanted to (smile) and because I was working with a large datafile that just took too long to process with Java.  It should work OK on non ASCII input that's 8-bit encoded (ISO-8859-1/-15, for example) as long as `.' indicates the presence of an abbreviation.  Somehow I suspect that this isn't the case for ideographic languages....