new CVS module "abbrevinducer"

Status: Inactive

Brought to you by: gann, jasonbaldridge, mwhite14850

new CVS module "abbrevinducer"

Forum: Open Discussion

Creator: Eric Friedman

Created: 2001-12-27

Updated: 2001-12-27

Eric Friedman - 2001-12-27

I just imported a new CVS module, "abbrevinducer." This is a (very) simple C program that can induce abbreviations from English language text in ASCII encoded files. The module includes the GNU "autobuild" files, so you should be able to just ./configure; make; make install to get a copy.

The README has some notes on how to use the tool's output to produce a unique list of abbreviations above some cutoff. That list can then turned into a java.util.Set and given to the sentence detecting ContextGenerator.

If anyone has experience with writing UTF-8 handling C code, I'd be glad for some pointers. I wrote this in C because I wanted to (smile) and because I was working with a large datafile that just took too long to process with Java. It should work OK on non ASCII input that's 8-bit encoded (ISO-8859-1/-15, for example) as long as `.' indicates the presence of an abbreviation. Somehow I suspect that this isn't the case for ideographic languages....

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.