suffix arrays for phrase extraction Icon

suffix arrays for phrase extraction

alpha
5.0 Stars (1)
0 Downloads (This Week)
Last Update:
  Browse Code SVN Repository

Description

Java Suffix array library for phrase discovery. Inspired initially by the classic paper of Yamamoto & Church, with newer ideas from Abouelhoda et al and Kim et al. Adapted for large alphabet so that words can be tokenized as alphabet characters.

suffix arrays for phrase extraction Web Site

Features

  • Adapted to large alphabet for NLP
  • Includes tokenizers, normalizers and symbol table.
  • Calculates term and document frequency; full distribution across texts
  • Modular design, user can apply various statistics to phrases
  • Includes Aho Corasick automaton, also for large alphabet
  • Needs: Better sorting, though radix qsort usually works okay
  • Needs: improvement to the symbol table. This is the slowest part.
  • Needs: Tokenizers, normalizers for more languages.
  • Needs: Some links to foma (foma.sourceforge.net/)

Update Notifications





User Ratings

★★★★★
★★★★
★★★
★★
1
0
0
0
0
ease 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 0 / 5
features 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 0 / 5
design 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 0 / 5
support 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 0 / 5
Write a Review

User Reviews

  • masonbarnes
    1 of 5 2 of 5 3 of 5 4 of 5 5 of 5

    Works, very small, no complaints.

    Posted 09/18/2012
Read more reviews

Additional Project Details

Intended Audience

Information Technology

Programming Language

Java

Registered

2010-05-02
Screenshots can attract more users to your project.
Features can attract more users to your project.

Icons must be PNG, GIF, or JPEG and less than 1 MiB in size. They will be displayed as 48x48 images.