CAUTION: For version before jobimtext_pipeline_0.0.6 use the documentation found here.
This page describes, how to use the jobim text project to use distributional similarities or contextualized similarities in your project. In this example we consider the that you work on the same computer, the Hadoop server is running. The used compenents use some components from DKPro Core, uimaFIT and OpenNLP.
The files needed for this tutorial can be downloaded from the download section and are contained in the archive jobimtext_pipeline_vXXX.tar.gz. For the feature extraction we need also text files. The format of the files should be plain text. A corpus of Web data and of sentences extracted from English Wikipedia are available here and start with the prefix dataset_. We advise to split them, so UIMA does not have to keep the complete file in the memory.
This can be done using the split command from linux:
split news10M splitted/news10M-part-
Then the holing system can be started to extract the features. This can be done using the shell script holing_operation.sh in the download section, or by executing the jobimtext.example.holing.HolingHadoop class in the maven svn project jobimtext/jobimtext.example. The script can be executed as following
sh holing_operation.sh path pattern output extractor_configuration holing_system_name
and has following parameters:
An example for the extractor_configuration file is shown below:
<jobimtext.holing.extractor.JobimExtractorConfiguration> <keyValuesDelimiter> </keyValuesDelimiter> <extractorClassName> jobimtext.holing.extractor.TokenExtractors$LemmaPos </extractorClassName> <attributeDelimiter>#</attributeDelimiter> <valueDelimiter>_</valueDelimiter> <valueRelationPattern>$relation($values)</valueRelationPattern> <holeSymbol>@</holeSymbol> </jobimtext.holing.extractor.JobimExtractorConfiguration>
The output of the holing system using this configuration file leads to a tab separated (specified by keyValuesDelimiter) key and context feature separated list. The element extractorClassName specifies how an entry is concatenated, in this case the lemma and the POS tag of a word are used and concatenated using a dash (#) as defined with attributeDelimiter. The name of the relation and the context features are concatenated following the valueRelationPattern pattern. An output of the holing system using the MaltParser with the introduced extractor file for the sentence "I gave the book to the girl" leads to the following result:
I#PRP -nsubj(@_give#VB) give#VB nsubj(@_I#PRP) give#VB prep(@_to#TO) give#VB dobj(@_book#NN) give#VB punct(@_.#.) the#DT -det(@_book#NN) book#NN -dobj(@_give#VB) book#NN det(@_the#DT) to#TO pobj(@_girl#NN) to#TO -prep(@_give#VB) the#DT -det(@_girl#NN) girl#NN det(@_the#DT) girl#NN -pobj(@_to#TO) .#. -punct(@_give#VB)
One can observe that the tokens are lemmatized and the Pos tags are concatenated to the lemma of the token using the dash.
Afterwards the file should be splitted again and then transferred to the file system (hdfs) of the MapReduce server:
split -a 5 -d news10M_hadoop_input splitted/news10M_maltdependency_part- hadoop dfs -copyFromLocal splitted news10M_maltdependency
the execution pipeline for the MapReducer can be generated using the script generateHadoopScript using the following parameters:
generateHadoopScript.py dataset wc s t p significance simsort_count [computer file_prefix]
with
for example the command
python generateHadoopScript.py news10M_maltdependency 1000 0 0 1000 LL 200 desktop_computer dt/
will lead to the output file named news10M_maltdependency_s0_t0_p1000_LL_simsort200 with the following content:
#hadoop dfs -rmr context_out news10M_maltdependency__WordFeatureCount #hadoop dfs -rmr wordcount_out news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount #hadoop dfs -rmr featurecount_out news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount #hadoop dfs -rmr freqsig_out news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0 #hadoop dfs -rmr context_filter_out news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount #hadoop dfs -rmr prunegraph_out news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000 #hadoop dfs -rmr aggregate_out news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt #hadoop dfs -rmr simcount_out news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures #hadoop dfs -rmr simsort_out news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200 hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.WordFeatureCount news10M_maltdependency news10M_maltdependency__WordFeatureCount True pig -param contextout=news10M_maltdependency__WordFeatureCount -param out=news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount -param wc=1000 pig/PruneFeaturesPerWord.pig hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.FeatureCount news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount True hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.WordCount news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount True pig -param s=0 -param t=0 -param wordcountout=news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount -param featurecountout=news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount -param contextout=news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount -param freqsigout=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0 pig/FreqSigLL.pig pig -param p=1000 -param freqsigout=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0 -param prunegraphout=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000 pig/PruneGraph.pig hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.AggrPerFt news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000 news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt True hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.SimCounts1WithFeatures news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures True pig -param limit=200 -param IN=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures -param OUT=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200 pig/SimSort.pig ssh desktop_computer 'mkdir -p dt ' hadoop dfs -text news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount/p* | ssh desktop_computer 'cat -> dt/news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount ' hadoop dfs -text news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0/p* | ssh desktop_computer 'cat -> dt/news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0 ' hadoop dfs -text news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount/p* | ssh desktop_computer 'cat -> dt/news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount ' hadoop dfs -text news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200/p* | ssh desktop_computer 'cat -> dt/news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200 '
The first lines of the script are commented and can be used to delete the files from the server. After executing the script we have to wait until the Hadoop server is finished. The files are copied to the specified computer into the folder specified by the prefix.
Yet we support two databases: MySQL and DCA, a memory-based data server provided within this project. Here we will only describe the DCA server. The configuration files for the DCA are generated using the create_db_dca.sh script:
sh create_db_dca.sh folder prefix database_server folder: folder where all the files, from the Hadoop step are located prefix: prefix for the files (e.g. news10M_maltparser) database_server: name of server, where the database runs
This command create two files PREFIX_dcaserver and PREFIX_dcaserver_tables.xml with the following content:
<jobimtext.util.db.conf.DatabaseTableConfiguration> <tableOrder2>subset_wikipedia-maltparser__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200</tableOrder2> <tableOrder1>subset_wikipedia-maltparser__FreqSigLL_s_0_t_0</tableOrder1> <tableValues>subset_wikipedia-maltparser__FeatureCount</tableValues> <tableKey>subset_wikipedia-maltparser__WordCount</tableKey> </jobimtext.util.db.conf.DatabaseTableConfiguration>
and
# TableID ValType TCPP# TableLines CacheSize MaxValues DataAllocation InputFileNames/Dir FileFilter subset_wikipedia-maltparser__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200 TABLE 8080 0 10000 10000 server[0-19228967] /home/user/data/out/dt/subset_wikipedia-maltparser__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200 NONE subset_wikipedia-maltparser__FreqSigLL_s_0_t_0 TABLE 8081 0 10000 10000 server[0-19228967] /home/user/data/out/dt/subset_wikipedia-maltparser__FreqSigLL_s_0_t_0 NONE subset_wikipedia-maltparser__FeatureCount SCORE 8082 0 10000 10000 server[0-19228967] /home/user/data/out/dt/subset_wikipedia-maltparser__FeatureCount NONE subset_wikipedia-maltparser__WordCount SCORE 8083 0 10000 10000 server[0-19228967] /home/user/data/out/dt/subset_wikipedia-maltparser__WordCount NONE
Further details for the DCA server are specified within the README file withint the DCA project in the subversion. The server then be started using the PREFIX_dcaserver configuration file using the following command:
java -Xmx... -Xms.... -cp $(echo lib/*jar| tr ' ' ':') com.ibm.sai.dca.server.Server PREFIX_dcaserver
sh apply_dt_ct.sh $APPLY_FOLDER $APPLY_FILE $EXTRACTOR $HOLINGSYSTEM $FILE"_dcaserver" $FILE"_dcaserver_tables.xml"
When also specfying the computer and the folder using the script from the previous script, the data is availabel locally.
When all the data is loaded into the database, we can use script apply_dt_ct.sh to get expansions of words for new documents.
The
-------------------------------------------------------------------------------- sh apply_dt_ct.sh path pattern holing_system_name extractor_configuration database_configuration database_tables -------------------------------------------------------------------------------- path: path of the files (also zip files could be used e.g.: jar:file:/dir/file.zip! pattern: pattern the files matches, that should be expanded (e.g. *.txt for all txt files) extractor_configuration: file that contains all informations needed for the output format for Keys and Features holing_system_name: Ngram[hole_position,ngram] or MaltParser (Standard) database_configuration: configuration file needed for the dca server database_tables: condfiguration file for the java software, specifying the table names targetword: if true the target word has to be encapsulated using <target>word</target>. Otherwise every word will be expanded. (Default value : true) --------------------------------------------------------------------------------
The input format of the files can be plain text, when exanding all words. Therefore, the parameter targetword should be set to false. When expanding solely selected words they should be encapsulated by <target>target_word</target>.
Here we show an example to execute all steps, where everything (hadoop server) is running on one system using the MaltParser. Probably the number of lines the files are splitted should be adjusted to the used dataset.
FILEDIR=/home/user/data FILE=textfile OUTPUT=/home/user/data/out DB_SERVER=server EXTRACTOR=extractor_standard.xml HOLINGSYSTEM=MaltParser HOLINGNAME=maltparser #Holing Operation mkdir -p $OUTPUT mkdir -p $OUTPUT/splitted/ split $FILEDIR/$FILE $OUTPUT/splitted/$FILE sh holing_operation.sh $OUTPUT/splitted $FILE* $OUTPUT/$FILE-$HOLINGNAME $EXTRACTOR $HOLINGSYSTEM mkdir $OUTPUT/$FILE-$HOLINGNAME-splitted/ #Compute distributional similarity split -a 5 -l 2500000 -d $OUTPUT/$FILE-$HOLINGNAME $OUTPUT/$FILE-$HOLINGNAME-splitted/part- hadoop dfs -copyFromLocal $OUTPUT/$FILE-$HOLINGNAME-splitted $FILE-$HOLINGNAME mkdir $OUTPUT/dt/ python generateHadoopScript.py $FILE-$HOLINGNAME 0 0 1000 LL 200 localhost $OUTPUT/dt/ sh $FILE-$HOLINGNAME"_s0_t0_p1000_LL_simsort200"
#Load and start databaseserver sh create_db_dca.sh $OUTPUT/dt/ $FILE $DB_SERVER java -Xmx3g -cp $(echo lib/*jar| tr ' ' ':') com.ibm.sai.dca.server.Server $FILE"_dcaserver"
APPLY_FOLDER=./ APPLY_FILE=test.txt #start dt and ct on file sh apply_dt_ct.sh $APPLY_FOLDER $APPLY_FILE $EXTRACTOR $HOLINGSYSTEM $FILE"_dcaserver" $FILE"_dcaserver_tables.xml"
Wiki: Home
Wiki: jobimtext_programming
where to look(path) for "holing_operation.sh" file?
Dear Amrit,
the holing_operation.sh script was used in a previous JoBImText version. You can still use it if you download the following archive:
http://sourceforge.net/projects/jobimtextgpl.jobimtext.p/files/jobimtext_demo_stanford-0.0.4.zip/download
Hi,
"sh holing_operation.sh ../splitted/ * output.txt extractor_relation.xml MaltParser"
I am running this command in "jobimtext_demo_stanford-0.0.4"
But I am getting this error :
Mar 23, 2018 12:06:11 PM org.uimafit.util.ExtendedLogger info(255)
INFO: Found [0] resources to be read
Holing System (conf_mysql_np_local.xml) not available. Available systems: Suffix, MaltParser, Ngram,
Hi Amrit,
to problem in the comand is the asterisk (*) without quotes. Running the command as following should work:
sh holing_operation.sh ../splitted/ "*" output.txt extractor_relation.xml MaltParser
Best,
Martin
Hi,
"http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/sense-labelling/sense-labelling-v-0-1-0-0-1-2/"
I am trying to implement sense labelling using the documentation of the above given link.
"java -cp lib/org.jobimtext.pattamaika-0.1.2.jar org.jobimtext.pattamaika.SenseLabeller pattern.txt sense.txt output.txt 0"
I got the following error :
"Mar 23, 2018 5:42:21 PM org.jobimtext.pattamaika.SenseLabeller main
INFO: Performing Sense Labelling..
Mar 23, 2018 5:42:21 PM org.jobimtext.pattamaika.SenseLabeller appendScore
INFO: Pattern file read, applying to Sense Clusters
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
at org.jobimtext.pattamaika.SenseLabeller.appendScore(SenseLabeller.java:107)
at org.jobimtext.pattamaika.SenseLabeller.appendScore(SenseLabeller.java:79)
at org.jobimtext.pattamaika.SenseLabeller.main(SenseLabeller.java:41) "
I don't know where am i wrong. Any help would be appreciated.
I am looking for more detailed documentation on sense labelling.
Is there any minimum requirement of number of lines in the data files "pattern.txt" and "sense.txt".
I just copied the sentences from the examples files present in the documnetation(the link provided above).
Hi Amrit,
so if you want to have a more recent documentation you can find it in the slide decks of our tutorial:
https://sites.google.com/site/jobimtexttutorial/resources
There is a full example of all steps (with some hadoop VM). You can execute most commands if you have a hadoop cluster with the most recent source code on sourceforge.
regarding your issues:
there seems to be some issue with your patterns.txt and senses.txt file. Check the following:
senses.txt: the information is separated by tab
pattern.txt: the pattern (e.g. dog ISA animal) is separated by whitespaces and the "pattern" and the score are separated by tab
Regarding the size:
Best is to have various heads in the patterns (e.g. dog for the example above) for each word in the sense file (for the words that define the sense, e.g. "cat,dog,rat" for sense 0 for the word mouse). Normally, you compute the patterns from large amounts of text. Here cou can download some patterns (in a slightly different format):
http://tudarmstadt-lt.github.io/taxi/
Best,
Martin
Hi Martin,
Thanks for your help and now i am able to resolve all my errors.
Now I have got a sense cluster file. And for sense labelling, we require a sense cluster file as well as a pattern file.
"http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/sense-clustering/"
From the above site, I got the output as a sense cluster file.
Now for sense labelling "http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/sense-labelling/"
"java -cp path/to/org.jobimtext.pattamaika-*.jar org.jobimtext.pattamaika.SenseLabeller -p pattern-file -s sense-cluster-file -o output-file [optional parameters]"
we require a pattern file now. And you shared a link in our previous conversation : "http://tudarmstadt-lt.github.io/taxi/"
Will this site help me provide all types of pattern?
Thanks.
that's great news!
For the pattern file I would use one of the English General Domain, e.g.:
http://panchenko.me/data/joint/taxi/res/resources/en_pm.csv.gz
Of course it will not contain ALL types of patterns, but I guess it might contain enough patterns to have a generally good coverage.
Please also check that the format is correct (see post above).
Best,
Martin
Hi Martin,
First of all thanks for all the assistance provided by you.
Dear Amrit,
I assume you are getting this error, as the dt-file is compressed. You need to decompress the wikipedia_stanford*.gz file (gunzip wikipedia...) and then start the command again. This will generate the different senses for each word.
What is the purpose with "generating clustered file" for normal set of sentence?
If you want to computed the senses for document collection, you have to compute a DT and then use this DT for the sense computation with Chinese Whispers.
Best,
Martin
Hi Martin,
Thanks. I was looking for how to get the dt file and I got a link "http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/calculate-a-distributional-thesaurus-dt/"
But, This documentation requires "bigram_holing.sh".
Dear Amrit,
for computing, you would require a Hadoop cluster. Furthermore, I would advise to use the more recent documentation from the KONVENS tutorial:
https://sites.google.com/site/konvens2016jobimtexttutorial/
Furthermore, for Hadoop computations you do not need the virtual machine as describes (this is just for testing), but just the Hadoop cluster and you might also use the recent jobimtext version:
https://sourceforge.net/projects/jobimtext/files/jobimtext_pipeline_0.1.2.tar.gz/download
Best,
Martin
Hi Martin,
I followed the exact same documentation cited by you to get the dt file.
Documentation link : https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxrb252ZW5zMjAxNmpvYmltdGV4dHR1dG9yaWFsfGd4OjUzOTgzMjlmMThiMDVmNGM
Update 1: /slf4j-api-.jar: I added this file at the corresponding path and that error is not showing now.But the files I got are still blank files.
Thanks.
Amrit
Last edit: AMRIT BHASKAR 2018-05-10
Hi Amrit,
sorry for the late response. Which commands did you execute? And did you try to run the software using the VM or do you have an Hadoop cluster? And which input data did you use?
Best,
Martin
Last edit: Martin Riedl 2018-05-29