JoBimText Wiki

Linking Language to Knowledge with Distributional Semantics

Status: Beta

Brought to you by: apanchenko, biem-tuda, coppolab, eugenso, and 4 others

jobimtext_programming

This page will describe how the system can be integrated using the source code. Here we use some components from DKPro Core ASL, uimaFIT and OpenNLP for reading the files, processing the pipeline and tokenizing and dependency parsing.

Get the sourcecode
Built an executable
Use the framework within Maven
The Holing System
- Write a new Holing System
- Transform Annotations to String
Get distributional similarities
Annotate target words using a UIMA pipeline
Contextualized Similarities
More Examples

Get the sourcecode

1) Check out all projects from the trunk in SVN
2) if you use Eclipse the dependencies between the projects should work out of the box (created with Eclipse 4.2).

Built an executable

To build all executables, needed for a standalone version of the project Ant is used. This step also packages all projects into separate jar files.

cd examples
ant dist.dependencies

This creates a folder jobimtext_pipeline_vXXXX in the examples project which contains a lib folder with all jars from the projects and scripts. To use this output follow the description on the page [jobimtext_pipeline]

Use the framework within Maven

To use the jars within a Maven project one should first compile all projects (run ant dist.depenendecies within the examples project) and then execute the shell script addJarsToMaven.sh within the jobimtext_pipeline_..* folder. All projects are then added to the local repository and can be included into the POM. The artifact ids start with jobimtext, so the packages can be easier found.

The Holing System

The holing system operates on the JoBim Annotation. A JoBim Annotation covers a key (the Jo see [Holing_System]) it belongs to. It has two fields: key (Jo) and a FSArray of values (the Bims).

Write a new Holing System

To write a new holing System for e.g. trigrams, where we always want to keep the first word as key (Jo) and the following words as value (Bim), we can just write an Uima Annotator:

public class TrigramAnnotator extends JCasAnnotator_ImplBase{
    public void process(JCas jcas) throws AnalysisEngineProcessException {
        LinkedList<Token> list = new LinkedList<Token>();
        for (Token t : select(jcas,Token.class)) {
            list.add(t);
            if(list.size()==3){
                addJobimTrigram(jcas,list);
                list.removeFirst()
            }            
        }

        for(int i=0;i<2;i++){
            //add pseudo tokens 
            list.add(new Token(jcas));
            if(list.size()==3){
                addJobimTrigram(jcas,list);
            }
        } 
    }
    public void addJobimNgram(JCas jcas, LinkedList<Annotation> list) {
        Annotation key = list.getFirst();
        JoBim jb = new JoBim(jcas,key.getBegin(),key.getEnd());
    FSArray array = new FSArray(jcas, 2);
    for (int i = 1; i < list.size(); i++) {
            array.set(i-1,list.get(i);
    }
        jb.setRelation("trigram");
        jb.setKey(key);
        jb.setHole(0);
    jb.setValues(array);
    jb.addToIndexes();
    }

}

For the contextualization it is important, that the JoBim Annotation only covers the key. This has the advantage, that in the contextualization step one can iterate over all relations of one key (Jo), using e.g. the uimafit method JCasUtil.selectCovered(JoBim.class, someToken).

Transform Annotations to String

To operate on e.g. Tokens, a new class is written which extends the JobimAnnotationExtractor class from the jobimtext.holing project.

package jobimtext.holing.extractor;
import org.apache.uima.jcas.tcas.Annotation;
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token;

public class LemmaPos extends JobimAnnotationExtractor {
    public LemmaPos(JobimExtractorConfiguration configuration) {
    super(configuration);
    }

    public String extract(Annotation a) {
    Token t = (Token) a;
    String word;
    if (t.getLemma() != null) {
        word = t.getLemma().getValue();
    } else {
        word = t.getCoveredText();
    }
    if (t.getPos() != null) {
        return word + conf.attributeDelimiter + t.getPos().getPosValue();
    } else {
            return word + conf.attributeDelimiter+"UNK";
    }
    }
}

When extending this class the the method extract(Annotation) has to be overridden, which
returns a string which contains all neccessary items used for the calculation of the distributional similarities. The exemplary class LemmaPos returns the Lemma of the token concatenated with the POS, if they are available. The separators for several attributes are specified in a configuration file, which is defined by the class JobimExtractorConfiguration. Furthermore, also the extractor class itself is defined in this configuration file, which then could look like shown below:

<jobimtext.holing.extractor.JobimExtractorConfiguration>
  <keyValuesDelimiter>  </keyValuesDelimiter>
  <attributeDelimiter>#</attributeDelimiter>
  <valueDelimiter>_</valueDelimiter>
  <extractorClassName>jobimtext.holing.extractor.LemmaPos</extractorClassName>
  <valueRelationPattern>$relation($values)</valueRelationPattern>
  <holeSymbol>@</holeSymbol>
</jobimtext.holing.extractor.JobimExtractorConfiguration>

To run the holing operation to extract dependencies from the MaltParser for the Hadoop step, we can use following pipeline:

String path = "dir";
String pattern="*.txt";
String outputFile = "output.txt"
extractorConfigurationFile = "extractor.xml"
CollectionReader reader = createCollectionReader(TextReader.class, 
                                TextReader.PARAM_PATH,path, 
                TextReader.PARAM_PATTERNS, new String[] { "[+]" + pattern }, 
                                TextReader.PARAM_LANGUAGE, "en");
AnalysisEngine tokenizer = createPrimitive(OpenNlpSegmenter.class);
AnalysisEngine posTagger = createPrimitive(OpenNlpPosTagger.class);
AnalysisEngine lemma = createPrimitive(LemmatizeAnnotator.class);
AnalysisEngine parser = createPrimitive(MaltParser.class);
AnalysisEngine dep = createPrimitive(DependencyAnnotator.class);
AnalysisEngine out = createPrimitive(HadoopOutputRelations.class,
                HadoopOutputRelations.PARAM_EXTRACTOR_CONFIGURATION_FILE,
                extractorConfigurationFile,
                HadoopOutputRelations.PARAM_OUTPUT_FILE, outputFile);
SimplePipeline.runPipeline(reader, tokenizer, posTagger,lemma, parser, dep, out);

Get distributional similarities

To get lexical expansions without any UIMA components, one can use the classes

DCAThesaurus: Uses the DCA memory-cached database server and returns the types of this server
DCAThesaurusDatastructure: Uses the DCA memory-cached database server, but returns a lists of objects entries, which hold a similar words and its score.
DatabaseThesaurusDatastructure: This class can be used in combination of a mysql server. To use it a mysql-java-connector has be added to the project

All these classes implement in princible the IThesaurus interface, which defines following methods:

ORDER2LIST getExpansions(KEY key): get lexical expansions, according to a key (Jo)
ORDER2LIST getExpansions(KEY key, int N): get N lexical expansions, according to a key (Jo)
ORDER2LIST getExpansions(KEY key, double threshold): get lexical expansions with a score higher than the threshold
Long getKeyCount(KEY key): returns how often key (Jo) occurs
Long getValuesCount(VALUES value): returns how often value (Bim) occurs
Double getKeyValueScore(KEY key, VALUES val): returns the significance score between the key (Jo) and the value (Bim)
ORDER1LIST getKeyValuesScores(KEY key): returns a list with all values according to one key with their significance score
ORDER1LIST getKeyValuesScores(KEY key, int N): returns a list with the top N values according to one key with their significance score
ORDER1LIST getKeyValuesScores(KEY key, double threshold): returns a list with all values according to one key with their significance score filtered by a threshold
boolean connect(): connect to the resource
void destroy(): release all used resources

Both DCA thesauri, expect a configuration file, specified by the DCA server and a xml file, which names the tables, for the jobimtext mapping. An example for the table XML file is shown on [jobimtext_pipeline]. For an easier interface using the IThesaurus interface there is an interface IThesaurusDatastructure which returns lists of Order1/Order2 objects. To use this interface only the type of Jo and Bim have to be specified.

DCAThesaurusOrder2

String dbConfigurationFile= "dcaserver_config";
String dbTablesFile= "dbTables.xml";
IThesaurusDatastructure<String, String> thesaurus = new DCAThesaurusOrder2(dbConfigurationFile, dbTablesFile);
thesaurus.connect();
List<Order2>  exps= thesaurus.getExpansions("give#NN");
for(Order2 exp:exps){
    System.out.println(exp.key+"\t"+exp.score);
}
thesaurus.destroy();

DCAThesaurus

String dbConfigurationFile= "dcaserver_config";
String dbTablesFile= "dbTables.xml";
DCAThesaurus dcaThesaurus = new DCAThesaurus(dbConfigurationFile, dbTablesFile);
dcaThesaurus.connect();
ContentValue_Table  exps= dcaThesaurus.getExpansions("give#NN");
for(String key:exps.keySet()){
    System.out.println(key+"\t"+exps.get(key));
}
dcaThesaurus.destroy();

DatabaseThesaurus

The database thesaurus is optimized for MySQL databases.
First the data needs to be imported to databases. We use following datas structure:

CREATE DATABASE database;
USE database;

CREATE TABLE `word_count` (
  `word` varchar(150) DEFAULT NULL,
  `count` int(11) DEFAULT NULL,
  KEY `f` (`word`)
);

CREATE TABLE `feature_count` (
  `feature` varchar(150) DEFAULT NULL,
  `count` int(11) DEFAULT NULL,
  KEY `f` (`feature`)
);

CREATE TABLE `LMI_s_0_t_0` (
  `word` varchar(150) DEFAULT NULL,
  `feature` varchar(150) DEFAULT NULL,
  `sig` double DEFAULT NULL,
  KEY `w` (`word`),
  KEY `f` (`feature`)
);

CREATE TABLE `LMI_s_0_t_0_p_1000_l_200_simsort` (
  `word1` varchar(150) DEFAULT NULL,
  `word2` varchar(150) DEFAULT NULL,
  `count` int(11) DEFAULT NULL,
  KEY `w1` (`word1`),
  KEY `w2` (`word2`)
);

The data can then be imported using the INFILE command:

LOAD DATA INFILE 'data.txt' INTO TABLE database.my_table;

The settings of the server and the tables are specfied in a configuration file:

<jobimtext.util.db.conf.SqlThesaurusConfiguration>
  <data>
    <tableOrder2>LMI_s_0_t_0_p_1000_l_200_simsort</tableOrder2>
    <tableOrder1>LMI_s_0_t_0</tableOrder1>
    <tableValues>feature_count</tableValues>
    <tableKey>word_count</tableKey>
  </data>
  <dbUser>user</dbUser>
  <dbPassword>password</dbPassword>
  <dbUrl>jdbc:mysql://server:3306/database</dbUrl>
  <jdbcString>com.mysql.jdbc.Driver</jdbcString>
</jobimtext.util.db.conf.SqlThesaurusConfiguration>

it can then be used similarly to the DCAThesaurusOrder2:

String dbConf= "db_conf.xml";
IThesaurusDatastructure<String, String> dbThesaurus = new DatabaseThesaurus(dbConf);
dbThesaurus.connect();
List<Order2>  exps= dbThesaurus.getExpansions("give#NN");
for(Order2 exp:exps){
    System.out.println(exp.key+"\t"+exp.score);
}
dbThesaurus.destroy();

Annotate target words using a UIMA pipeline

In the last section the key had to be written as defined in the extractor method. There is a DistributionalThesaurusAnnotator, that can be used to annotate Tokens/Annotations with the expansions retrieved by the distributional thesaurus. To instantiate it, we first need to instantiate two external resources, one for the connection to the database and one for the mapping between the Annotation and the String stored in the database:

String extractor = "extractor.xml";
String dbConfigurationFile= "dcaserver_config";
String dbTablesFile= "dbTables.xml";

ExternalResourceDescription extDTSimple = ExternalResourceFactory
        .createExternalResourceDescription(DCAThesaurusDatastructure.class,
                DCAThesaurusDatastructure.PARAM_DB_CONFIGURATION_FILE, dbConfigurationFile,
                DCAThesaurusDatastructure.PARAM_DB_TABLES_FILE,dbTablesFile
                );

ExternalResourceDescription extAnnotationThesaurus = ExternalResourceFactory
        .createExternalResourceDescription(
                UimaAnnotationThesaurus.class,
                UimaAnnotationThesaurus.PARAM_STRING_THESAURUS,
                extDTSimple,
                UimaAnnotationThesaurus.PARAM_EXTRACTOR_CONFIGURATION_FILE,
                extractor);

We can then instantiate the DistributionalThesaurusAnnotator, that can be used in a pipeline:

AnalysisEngineDescription dtAnnotator = .createPrimitiveDescription(
DistributionalThesaurusAnnotator.class,
    DistributionalThesaurusAnnotator.PARAM_TARGET_ANNOTATION,
                Target.class,
                DistributionalThesaurusAnnotator.PARAM_KEY_ANNOTATION,
                Token.class,
                DistributionalThesaurusAnnotator.PARAM_MODEL_THESAURUS,
                extAnnotationThesaurus,
                DistributionalThesaurusAnnotator.PARAM_NUMBER_OF_TOP_ENTRIES,
                100)

In this example snippet, we want to annotate only selected Tokens. This can be done, by annotating the target tokens with a new annotation that might be named Target. The DistributionaThesaurusAnnotator, then searches for the covered Tokens (PARAM_KEY_ANNOTATOR) and expands these. Additionally the thesaurus has be to be given by the parameter PARAM_MODEL_THESAURUS.

Contextualized Similarities

Similar to the UIMA Annotator for the distributional similarities works the contextualized similarity Annotator. First the external resources for the distributional thesaurus and the contextualizer have to be instantiated:

String dbConf = "src/test/resources/conf_db.xml";
        String extractor= "src/test/resources/extractor.xml";
        String folder = "src/test/resources/";

        ExternalResourceDescription extThesaurus = ExternalResourceFactory.
                    createExternalResourceDescription(DCAThesaurusDatastructure.class,
                        DCAThesaurusDatastructure.PARAM_DB_CONFIGURATION_FILE, dbConfigurationFile,
                        DCAThesaurusDatastructure.PARAM_DB_TABLES_FILE,dbTablesFile
                    );

        ExternalResourceDescription extDesc = ExternalResourceFactory
                .createExternalResourceDescription(
                        SimpleContextualizer.class,
                        SimpleContextualizer.PARAM_EXTRACTOR_CONFIGURATION_FILE, extractor,
                        SimpleContextualizer.PARAM_MODEL_THESAURUS, extThesaurus);

and then the Annotator can be instantiated. For the CT we define the boundaries of the context (e.g. sentence, paragraph) and can also define the target annotation which we first expand using the distributional thesaurus and which is then contextualized. In the example above, we expand all Tokens and use features, which are within the range of the same sentence.

AnalysisEngine context = createPrimitive(ContextAnnotator.class,
                ContextAnnotator.PARAM_CONTEXTUALIZER, extCT,
                ContextAnnotator.PARAM_TARGET_ANNOTATION_CLASS_NAME,
                Token.class,
                ContextAnnotator.PARAM_CONTEXT_ANNOTATION_CLASS_NAME,
                Sentence.class)

More Examples

more examples can be found in the project jobimtext.example.

Wiki: Holing_System
Wiki: Home
Wiki: jobimtext_pipeline