Java Suffix array library for phrase discovery. Inspired initially by the classic paper of Yamamoto & Church, with newer ideas from Abouelhoda et al and Kim et al. Adapted for large alphabet so that words can be tokenized as alphabet characters.

Features

  • Adapted to large alphabet for NLP
  • Includes tokenizers, normalizers and symbol table.
  • Calculates term and document frequency; full distribution across texts
  • Modular design, user can apply various statistics to phrases
  • Includes Aho Corasick automaton, also for large alphabet
  • Needs: Better sorting, though radix qsort usually works okay
  • Needs: improvement to the symbol table. This is the slowest part.
  • Needs: Tokenizers, normalizers for more languages.
  • Needs: Some links to foma (foma.sourceforge.net/)

Project Activity

See All Activity >

License

Apache Software License

Follow suffix arrays for phrase extraction

suffix arrays for phrase extraction Web Site

Other Useful Business Software
Forever Free Full-Stack Observability | Grafana Cloud Icon
Forever Free Full-Stack Observability | Grafana Cloud

Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
Create free account
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of suffix arrays for phrase extraction!

Additional Project Details

Intended Audience

Information Technology

Programming Language

Java

Related Categories

Java Linguistics Software, Java Natural Language Processing (NLP) Tool

Registered

2010-05-02