GermanLanguageProcessing4Lucene Wiki

Status: Beta

Brought to you by: bastjan

Home

General Information
The .jar-File
Using glp4lucene
The POM-FILE
Examplary use of the package
Understanding the implementation

General Information

The GermanLanguageProcessing4Lucene (glp4lucene) package is an (easy to use) extension to lucene to add some basic langauge processing capabilities for German to Lucene.
While some of these are already contained in Lucene for English, there seems to be no ready to use package for German (yet!).

glp4lucene contains the following features:

Lemmatizing: lemmatize text to be index. This is especially useful for synthetic languages like German with declensions (of nouns, adjectives, pronouns: for example die schönen Häuser vs. das schöne Haus or ein schönes Haus) and conjugations (of verbs: sprechen (speak): Ich spreche (I speak), du sprichst (you speak), er spricht (he speaks)). Lemmaitzing words is a requierement to look up synonyms from GermaNet. The Lemmatizer used here is the MATE-tool. given the right language model, it is applicable to other languages than German as well.
POS-weighting: the part-of-speech tagging uses the Stanford Maxent Tagger. Given the right language model, it is applicable to other languages than German as well.
Adding synonyms: what would be a search engine without synonym expansion? If your are looking for a word like Knast (colloquial for Gefängnis (eng. jail)) for example, searching for synonyms also let's you those text containing Gefängnis or Justizvollzugsanstalt. Since GermaNet is a proprietary software and you need a licence, this package also supports synonym list of two kinds: <word>\t<synonym>\n</synonym></word> or <word1>\t<word2>\t<similarity>\n</similarity></word2></word1>, where you have to provide a cutoff value for the similarity. Using such lists is of course also language independent. Also a different delimiter (default is "\t") can be provided.

Actually, since lemmatization and POS-tagging work language independent, i.e., if you provide the right model files for the desired language, the package will work for other langauges than German as well. Only the use of GermaNet is bound to German German.

The .jar-File

The .jar-file contains, besides the classes, the source files as well (in case you want to make adjustments). Besides the needed sources, it also contains an example package, which should be easy to understand and reproduce.

Using glp4lucene

I strongly recommend to use maven to deal with the dependencies. The .jar-file also contains a examplary pom.xml file which you can use in your own project. Maven will take care of most dependencies. There are still two packages, which you have to download your self (because they are not available via maven):

GermaNet API (as well as the appropriate GermaNet data!), get it here
Mate-tools anna-3.61.jar and the German model files, download here

The POM-FILE

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>bastjan.de.lucene</groupId>
    <artifactId>bastjan.de.lucene</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.6.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.6.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>4.6.0</version>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.3.1</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.11</version>
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>16.0.1</version>
        </dependency>
    </dependencies>
    <repositories>
        <repository>
            <id>maven2</id>
            <url>http://repo1.maven.org/maven2</url>
        </repository>
    </repositories>
</project>

Examplary use of the package

For Indexing:

public class Main {
    private static SynonymAnalyzerExample synonymAnalyzer = new SynonymAnalyzerExample(
            new TestSynonymEngine()
                    .start("/<path-to>/Desktop/Software/GermaNet/GN_V90_XML"),
            "/<path-to>/Downloads/ger-tagger+lemmatizer+morphology+graph-based-3.6/lemma-ger-3.6.model");

    // for the distribution similarity, see TestSynonymEngine.java for
    // implemenation and file format!
    /*
     * private static SynonymAnalyzerExample synonymDistSimAnalyzer = new
     * SynonymAnalyzerExample(new TestSynonymEngine().startDisSim(
     * "/<path-to>/Desktop/Software/de_news70M_pruned/LMI_p1000_l_200"));
     */
    @SuppressWarnings({ "deprecation" })
    public static void main(String args[]) throws IOException,
            InstantiationException, IllegalAccessException,
            ClassNotFoundException, SQLException, XMLStreamException,
            ParseException, TreeTaggerException {
        Version matchVersion = Version.LUCENE_46;
        Directory index = new SimpleFSDirectory(new File("index"));
        // configure the indexwriter to use the synonymAnalyzer!
        MaxentTagger tagger = new MaxentTagger("/<path-to>/Downloads/stanford-postagger-full-2014-01-04/models/german-dewac.tagger", new TaggerConfig("-model",
                "/<path-to>/Downloads/stanford-postagger-full-2014-01-04/models/german-dewac.tagger"), false);

        IndexWriterConfig config = new IndexWriterConfig(matchVersion,
                synonymAnalyzer);

        IndexWriter w = new IndexWriter(index, config);
        //clear index, just in case
        w.deleteAll();
        // new document
        Document doc = new Document();
        // add field to the document
        doc.add(new Field("content",
                tagger.tagString("Schönes, altes Landesgefängnis."), Field.Store.YES,
                Field.Index.ANALYZED_NO_NORMS,
                Field.TermVector.WITH_POSITIONS_OFFSETS));
        Document doc2 = new Document();
        doc2.add(new Field("content",
                tagger.tagString("Ein absolutes Traumkittchen."), Field.Store.YES,
                Field.Index.ANALYZED_NO_NORMS,
                Field.TermVector.WITH_POSITIONS_OFFSETS));
        Document doc3 = new Document();
        doc3.add(new Field("content",
                tagger.tagString("Ein absoluter Bunker."), Field.Store.YES,
                Field.Index.ANALYZED_NO_NORMS,
                Field.TermVector.WITH_POSITIONS_OFFSETS));
        // add document to the writer
        w.addDocument(doc);
        w.addDocument(doc2);
        w.addDocument(doc3);
        // save changes to the writer
        w.commit();
        // close writer
        w.close();
        System.err.println("done analyzing and writing index.");
        /*
         * CreateDistSimIndex distSim = new CreateDistSimIndex();
         * distSim.create();
         */
    }
}

The Analyzer for the glp4lucene package:

class SynonymAnalyzerExample extends Analyzer {
    private SynonymEngine engine;
    private static GermanLemmatizer gl;
    private Version version = Version.LUCENE_46;
    private final GermanLemmatizerProgram glp;
    private HashMap<String,Float> excludePOS;

    public SynonymAnalyzerExample(SynonymEngine engine,
            String lemmatizerModelString) {
        this.engine = engine;
        // defining these variables here and reusing them, saves some resources
        // when using an Analyzer Wrapper
        SynonymAnalyzerExample.gl = new GermanLemmatizer(lemmatizerModelString);
        this.glp = new GermanLemmatizerProgram();
        this.excludePOS = new HashMap<String,Float>();
        // Add the names of POSs to be ignored corresponding to the pos-tag-set
        // used! Regular expression can be used
        excludePOS.put("P.*",-5.0f);
        excludePOS.put("K.*",-5.0f);
        excludePOS.put("AP.*",-5.0f);
        excludePOS.put("CARD",-5.0f);
        excludePOS.put("ART",-5.0f);
        excludePOS.put("N.*",10.0f);
        excludePOS.put("V.*",5.0f);
        excludePOS.put("ADJ.*",8.0f);
        excludePOS.put("ADV",2.0f);
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName,
            Reader reader) {
        Tokenizer source = new StandardTokenizer(version, reader);
        TokenStream filter = new StandardFilter(version, source);

        filter = new DelimitedPartOfSpeechFilter(filter, "_".charAt(0));
        DecompoundDictionaryLoader dcdl = new DecompoundDictionaryLoader();
        filter = new DictionaryCompoundWordTokenFilter(version, filter, dcdl.loadFromMap(engine.getMap(), version),6,4,12,true);
        try {
            filter = new GermanLemmatizerFilter(filter, gl,
                    glp);
        } catch (Exception e2) {
            // TODO Auto-generated catch block
            e2.printStackTrace();
        }
        filter = new LowerCaseFilter(version, filter);
        try {
            filter = new SynonymFilter(filter, engine);
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        filter = new POSFilterOut(version, filter, excludePOS);
        //if stemming is apllied, is has to be applied after the steps above
        //filter = new GermanMinimalStemFilter(filter);
        return new TokenStreamComponents(source, filter);
    }
}

Understanding the implementation

The implementation is described here