The GermanLanguageProcessing4Lucene (glp4lucene) package is an (easy to use) extension to lucene to add some basic langauge processing capabilities for German to Lucene.
While some of these are already contained in Lucene for English, there seems to be no ready to use package for German (yet!).
glp4lucene contains the following features:
Actually, since lemmatization and POS-tagging work language independent, i.e., if you provide the right model files for the desired language, the package will work for other langauges than German as well. Only the use of GermaNet is bound to German German.
The .jar-file contains, besides the classes, the source files as well (in case you want to make adjustments). Besides the needed sources, it also contains an example package, which should be easy to understand and reproduce.
I strongly recommend to use maven to deal with the dependencies. The .jar-file also contains a examplary pom.xml file which you can use in your own project. Maven will take care of most dependencies. There are still two packages, which you have to download your self (because they are not available via maven):
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>bastjan.de.lucene</groupId> <artifactId>bastjan.de.lucene</artifactId> <version>0.0.1-SNAPSHOT</version> <dependencies> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> <version>4.6.0</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>4.6.0</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> <version>4.6.0</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.3.1</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> </dependency> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>16.0.1</version> </dependency> </dependencies> <repositories> <repository> <id>maven2</id> <url>http://repo1.maven.org/maven2</url> </repository> </repositories> </project>
For Indexing:
public class Main { private static SynonymAnalyzerExample synonymAnalyzer = new SynonymAnalyzerExample( new TestSynonymEngine() .start("/<path-to>/Desktop/Software/GermaNet/GN_V90_XML"), "/<path-to>/Downloads/ger-tagger+lemmatizer+morphology+graph-based-3.6/lemma-ger-3.6.model"); // for the distribution similarity, see TestSynonymEngine.java for // implemenation and file format! /* * private static SynonymAnalyzerExample synonymDistSimAnalyzer = new * SynonymAnalyzerExample(new TestSynonymEngine().startDisSim( * "/<path-to>/Desktop/Software/de_news70M_pruned/LMI_p1000_l_200")); */ @SuppressWarnings({ "deprecation" }) public static void main(String args[]) throws IOException, InstantiationException, IllegalAccessException, ClassNotFoundException, SQLException, XMLStreamException, ParseException, TreeTaggerException { Version matchVersion = Version.LUCENE_46; Directory index = new SimpleFSDirectory(new File("index")); // configure the indexwriter to use the synonymAnalyzer! MaxentTagger tagger = new MaxentTagger("/<path-to>/Downloads/stanford-postagger-full-2014-01-04/models/german-dewac.tagger", new TaggerConfig("-model", "/<path-to>/Downloads/stanford-postagger-full-2014-01-04/models/german-dewac.tagger"), false); IndexWriterConfig config = new IndexWriterConfig(matchVersion, synonymAnalyzer); IndexWriter w = new IndexWriter(index, config); //clear index, just in case w.deleteAll(); // new document Document doc = new Document(); // add field to the document doc.add(new Field("content", tagger.tagString("Schönes, altes Landesgefängnis."), Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS)); Document doc2 = new Document(); doc2.add(new Field("content", tagger.tagString("Ein absolutes Traumkittchen."), Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS)); Document doc3 = new Document(); doc3.add(new Field("content", tagger.tagString("Ein absoluter Bunker."), Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS)); // add document to the writer w.addDocument(doc); w.addDocument(doc2); w.addDocument(doc3); // save changes to the writer w.commit(); // close writer w.close(); System.err.println("done analyzing and writing index."); /* * CreateDistSimIndex distSim = new CreateDistSimIndex(); * distSim.create(); */ } }
The Analyzer for the glp4lucene package:
class SynonymAnalyzerExample extends Analyzer { private SynonymEngine engine; private static GermanLemmatizer gl; private Version version = Version.LUCENE_46; private final GermanLemmatizerProgram glp; private HashMap<String,Float> excludePOS; public SynonymAnalyzerExample(SynonymEngine engine, String lemmatizerModelString) { this.engine = engine; // defining these variables here and reusing them, saves some resources // when using an Analyzer Wrapper SynonymAnalyzerExample.gl = new GermanLemmatizer(lemmatizerModelString); this.glp = new GermanLemmatizerProgram(); this.excludePOS = new HashMap<String,Float>(); // Add the names of POSs to be ignored corresponding to the pos-tag-set // used! Regular expression can be used excludePOS.put("P.*",-5.0f); excludePOS.put("K.*",-5.0f); excludePOS.put("AP.*",-5.0f); excludePOS.put("CARD",-5.0f); excludePOS.put("ART",-5.0f); excludePOS.put("N.*",10.0f); excludePOS.put("V.*",5.0f); excludePOS.put("ADJ.*",8.0f); excludePOS.put("ADV",2.0f); } @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = new StandardTokenizer(version, reader); TokenStream filter = new StandardFilter(version, source); filter = new DelimitedPartOfSpeechFilter(filter, "_".charAt(0)); DecompoundDictionaryLoader dcdl = new DecompoundDictionaryLoader(); filter = new DictionaryCompoundWordTokenFilter(version, filter, dcdl.loadFromMap(engine.getMap(), version),6,4,12,true); try { filter = new GermanLemmatizerFilter(filter, gl, glp); } catch (Exception e2) { // TODO Auto-generated catch block e2.printStackTrace(); } filter = new LowerCaseFilter(version, filter); try { filter = new SynonymFilter(filter, engine); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } filter = new POSFilterOut(version, filter, excludePOS); //if stemming is apllied, is has to be applied after the steps above //filter = new GermanMinimalStemFilter(filter); return new TokenStreamComponents(source, filter); } }
The implementation is described here