Menu

How to install third party lucene analyzer into ldi?

Help
2014-03-06
2014-03-11
  • Igor Kovalyov

    Igor Kovalyov - 2014-03-06

    I tested org.apache.lucene.analysis.ru.RussianAnalyzer.
    It support search of words where base form of word is truncated word.
    But in Russian language base form not always such.
    For example base form of word in latin transcription "vetra" is "veter", when I search veter Ldi don't finds vetra.

    It seems that third party product http://code.google.com/p/russianmorphology/
    supports Russian morphology better and is for lucene 3.0 framework.

    Is it possible to install this product into Ldi?
    If yes, can You tell me, how to do it?
    If no, is it possible to make org.apache.lucene.analysis.ru.RussianAnalyzer more clever?
    May be it contains dictionary of word forms which can be changed?

     
  • Marcelo F. Ochoa

    Igor:
    If you have a file .jar with the new Analyzers compiled using a Java 1.5 version (preferred 11g/12c $ORACLE_HOME/jdk) you could simple load it using loadjava tool.
    If you take a look at build.xml file all targets loads .jar file using an Ant macro which uses loadjava tool.
    For example:

    oracle@pocho:/tmp$ loadjava -u LUCENE/LUCENE@test -v -g PUBLIC -s -r morph-1.0.jar
    arguments: '-u' 'LUCENE/***@test' '-v' '-g' 'PUBLIC' '-s' '-r' 'morph-1.0.jar'
    ....
    Classes Loaded: 10
    Resources Loaded: 3
    Sources Loaded: 0
    Published Interfaces: 0
    Classes generated: 0
    Classes skipped: 0
    Synonyms Created: 10
    Errors: 0

    oracle@pocho:/tmp$ loadjava -u LUCENE/LUCENE@test -v -g PUBLIC -s -r russian-1.0.jar
    arguments: '-u' 'LUCENE/***@test' '-v' '-g' 'PUBLIC' '-s' '-r' 'russian-1.0.jar'
    ...
    Classes Loaded: 4
    Resources Loaded: 4
    Sources Loaded: 0
    Published Interfaces: 0
    Classes generated: 0
    Classes skipped: 0
    Synonyms Created: 4
    Errors: 0

    Then when you create the index the Analyzer is set using Analyzer parameter, for example:

    create table test_ru (comments clob);
    insert into test_ru values ('Россия - самая большая страна в мире');
    insert into test_ru values ('Её общая площадь составляет около семнадцати миллионов квадратных километров');
    insert into test_ru values ('Она охватывает восточную часть Европы и северную часть Азии');
    insert into test_ru values ('Россия омывается тремя океанами: Атлантическим, Тихим и Северным Ледовитым');
    CREATE INDEX test_ru_idx ON test_ru(comments) INDEXTYPE IS "LUCENE"."LUCENEINDEX"
    PARAMETERS (' Analyzer:org.apache.lucene.morphology.russian.RussianAnalyzer:LogLevel:ALL');
    select lscore(1),comments from test_ru where lcontains(comments,'Россия',1) > 0;
    0.4721330106258392333984375 Россия - самая большая страна в мире
    0.4721330106258392333984375 Россия омывается тремя океанами: Атлантическим, Тихим и Северным Ледовитым

    Best regards, Marcelo.

     
  • Igor Kovalyov

    Igor Kovalyov - 2014-03-11

    Thank You, Marcelo.

    This third party Analyzer really support Russian morphology better than org.apache.lucene.analysis.ru.RussianAnalyzer.

     

Log in to post a comment.