Java Suffix array library for phrase discovery. Inspired initially by the classic paper of Yamamoto & Church, with newer ideas from Abouelhoda et al and Kim et al. Adapted for large alphabet so that words can be tokenized as alphabet characters.
Features
- Adapted to large alphabet for NLP
- Includes tokenizers, normalizers and symbol table.
- Calculates term and document frequency; full distribution across texts
- Modular design, user can apply various statistics to phrases
- Includes Aho Corasick automaton, also for large alphabet
- Needs: Better sorting, though radix qsort usually works okay
- Needs: improvement to the symbol table. This is the slowest part.
- Needs: Tokenizers, normalizers for more languages.
- Needs: Some links to foma (foma.sourceforge.net/)
License
Apache Software LicenseFollow suffix arrays for phrase extraction
Other Useful Business Software
Forever Free Full-Stack Observability | Grafana Cloud
Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
Rate This Project
Login To Rate This Project
User Reviews
Be the first to post a review of suffix arrays for phrase extraction!