This project is supposed to list the Top R ranked terms that are of between M and N length. It is designed to extract these phrases from a given corpus in a input folder.
We are using a large archive of newspaper stories(GigaWordCorpus) as input to a parallel MPI program, and produce from that a list of top R terms of varying lengths M through N that are especially interesting.
The program is done in C using MPI.