Java Suffix array library for phrase discovery. Inspired initially by the classic paper of Yamamoto & Church, with newer ideas from Abouelhoda et al and Kim et al. Adapted for large alphabet so that words can be tokenized as alphabet characters.
- Adapted to large alphabet for NLP
- Includes tokenizers, normalizers and symbol table.
- Calculates term and document frequency; full distribution across texts
- Modular design, user can apply various statistics to phrases
- Includes Aho Corasick automaton, also for large alphabet
- Needs: Better sorting, though radix qsort usually works okay
- Needs: improvement to the symbol table. This is the slowest part.
- Needs: Tokenizers, normalizers for more languages.
- Needs: Some links to foma (foma.sourceforge.net/)
Works, very small, no complaints.