Menu

Parser Applications

David Fisher

There are some parser applications provided in the toolkit. ParseToFile writes parsed text to a file, ParseQuery parses queries and writes output to file, and ParseInQueryOp parses InQuery structured query language queries and writes output to file.

ParseToFile

ParseToFile parses documents and writes output in BasicDoc format. The program uses one of the toolkit's Parser classes to parse.

Usage: ParseToFile paramfile datfile1 datfile2 ...

Summary of parameters in paramfile:

  • outputFile Name of file to output parsed documents to.
  • stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are output to the file.
  • acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (e.g. USA U.S.A. USAs USA's U.S.A.) are left uppercase if in the acronym list. If no acronym list is specified, acronyms will not be recognized.
  • docFormat:
    • "trec" for standard TREC formatted documents
    • "web" for web TREC formatted documents
    • "chinese" for segmented Chinese text (TREC format, GB encoding)
    • "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
    • "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
  • stemmer:
    • "porter" Porter stemmer.
    • "krovetz" Krovetz stemmer.
    • "arabic" arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words

ParseQuery

ParseQuery parses queries using one of the toolkit's Parser classes and an Index.

Usage: ParseQuery paramfile datfile1 datfile2 ...

Summary of parameters in paramfile:

  • qryOutFile The name of the file to write the parsed queries to.
  • index Name of the index (with the extension).
  • stopwords Name of file containing stopword list. Words in this file should be one per line. If this parameter is not specified, all words are left in the query.
  • acronyms Name of file containing acronym list (one word per line). Uppercase words recognized as acronyms (eg USA U.S.A. USAs USA's U.S.A.) are left uppercase as USA if USA is in the acronym list. If no acronym list is specified, acronyms will not be recognized.
  • docFormat:
    • "trec" for standard TREC formatted documents
    • "web" for web TREC formatted documents
    • "chinese" for segmented Chinese text (TREC format, GB encoding)
    • "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
    • "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
  • stemmer:
    • "porter" Porter stemmer.
    • "krovetz" Krovetz stemmer.
    • "arabic" arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words

ParseInQueryOp

ParseInQueryOp parses queries using the InQueryOpParser class.

Usage: ParseQuery paramfile datfile1 datfile2 ...

The parameters are:

  • stopwords: name of file containing the stopword list.
  • acronyms: name of file containing the acronym list.
  • docFormat:
    • "trec" for standard TREC formatted documents
    • "web" for web TREC formatted documents
    • "chinese" for segmented Chinese text (TREC format, GB encoding)
    • "chinesechar" for unsegmented Chinese text (TREC format, GB encoding)
    • "arabic" for Arabic text (TREC format, Windows CP1256 encoding)
  • stemmer:
    • "porter" Porter stemmer.
    • "krovetz" Krovetz stemmer
    • "arabic" arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  • outputFile: name of the output file.

Related

Wiki: Home

MongoDB Logo MongoDB