Thank you for this great tool!
I have been using the first version in five languages (fr,de,en,es,ko) and it worked great!
Now, I want to update to the new version 1.2 and I have it running for english.
However, the annotation (wikify service) of my local service performs much worse than your online service.
I used the data provided by you and the following settings . I used the weka-models provided with version 1.2 and set the textprocessor to PorterStemmer.
Could you have a look, if I use the weka models correctly?
Shall I use the PorterStemmer, use another stemmer or leave this field blank?
What are the settings used of your online wikify service?
I am primarily interested in the wikify service and would like to increase its performance.
Do you have any suggestion how I could increase the annotation performance?
<!- This configuration file specifies properties for a single wikipedia dump ->
<!- MANDATORY: The language code of this wikipedia version (e.g. en, de, simple). ->
<!- MANDATORY: A directory containing a complete berkeley database. ->
<!- A directory containing csv files extracted from a wikipedia dump. Caching will be faster if these are available. ->
The full path to a text processor (a class that implements org.wikipedia.miner.util.text.TextProcessor)
responsible for resolving minor variations in labels, such as capitalization, punctuation and pluralism.
All labels will be created using this text processor unless otherwise specified. If the 'label' database
is cached to memory, it will be indexed using labels that have been prepared with this text processor.
The minimum number of links a page or label must receive before it will be cached to memory.
This can be used to exclude obscure or unpopular pages and labels.
This param is ignored unless the 'page' or 'label' database is cached to memory.
The probability of a sense is the number of links that use a particular label as an anchor, and
point to the sense concept, over the number of links made with the label as an anchor.
This param specifies the minimum proportion of links each sense must have before it will be cached
to memory and can be used to exclude unlikely senses.
This param is ignored unless the 'label' database is cached to memory.
The link probability of a label is the number of documents in which it is used as a link, over the
number of documents it is mentioned in at all.
This param specifies the minimum link probability a label must have before it will be cached to memory,
and can be used to exclude labels that are unlikely to refer to useful concepts.
This param is ignored unless the 'label' database is cached to memory.
A file containing words (one per line) that should be ignored when detecting concepts within documents
A list of databases that should be cached to memory, to make access to them much more efficient.
The value must correspond to an org.wikipedia.miner.db.WDatabase.DatabaseType.
The priority attribute can be either 'space' (default) or 'speed'.
A list of data dependencies that will be involved in generating relatedness measures between articles.
The more you add, the more accurate your measures will be, but the longer they will take to calculate.
The value must correspond to an org.wikipedia.miner.comparison.ArticleComparer.DataDependency
A file containing a Weka classifier for generating relatedness measures between articles
This must be trained using the same dependencies listed above.If you do not supply one,
then article comparisons will be made without machine learning.
<!- A file containing a Weka classifier for disambiguating pairs of labels ->
<!- A file containing a Weka classifier for generating relatedness measures between labels ->
<!- A file containing a Weka classifier for performing automatic disambiguation of topics in documents ->
<!- A file containing a Weka classifier for performing automatic link detection ->
just FYI. Using the PorterStemmer was not helpful.
Removing it, increased (subjectively, measured over the thumb and not evaluated thoroughly) precision and recall of the annotation service.
I know run it with the setting:
However, my local service still does not achieves the performance of the online services.
+1 on this.
I followed the directions to set up my own copy of the Wikipedia Miner Toolkit and while things do seem to work, the results from the WIkify service are very different (and worse) than what's on the demonstration website. Is this because of different wiki.xml settings or is it a matter of retraining? Will you guys release enough details to make the performance reproducible?
Cheers and thanks for creating such a useful toolkit,
I remember I was getting slightly worst results when enabling the cashing process in the configuration file. This was because by default when cashing is enabled, some articles below a given threshold (as defined in the configuration XML file) are discarded. You can either try disabling the cashing process or loosen those thresholds in the configuration file to see if your results improve.
Anyone has any luck? Or maybe the only way out is to subscribe to their paid service?
I observed that the length of the input text is an important factor for the quality of the annotations. The annotation quality (precision and even more w.r.t. recall) for a text of 500 words is a lot lower compared to chunking the same text into paragraphs of about 100-150 words and annotating each chunk separately.
Below is an example which highlights the difference between online demo and my installation. My installation can only detect one topic while the online demo detected several. Would be great if anyone could share how to tune this so the results would come close to that of the online demo. Thanks in advance!
Annotation demo at http://wikipedia-miner.cms.waikato.ac.nz
Responsibilities: Provide secretarial support to management team and assist with administrative duties. Maintain proper filing and accounts. Correspond with suppliers and outside parties. Answer customer enquiries professionally. Undertake administrative and clerical duties. Provide secretarial support to management. Requirements: Minimum 3 years' experience in office administration preferred Holder of LCC&I- Level II, /order processing experience Experience in /admin and independent Good command in  and Chinese (including ) Proficient in Chinese  and  office applications, including Word, Excel and  Good communication and interpersonal skills with all levels of staff Independent, committed, proactive, willing to learn, attentive to details, strong organizational skills, customer-oriented and high level of integrity
Annotation demo result at my own installation
Responsibilities: Provide secretarial support to management team and assist with administrative duties. Maintain proper filing and accounts. Correspond with suppliers and outside parties. Answer customer enquiries professionally. Undertake administrative and clerical duties. Provide secretarial support to management. Requirements: Minimum 3 years' experience in office administration preferred Holder of LCC&I- Level II, logistics/order processing experience Experience in accounting/admin and independent correspondence Good command in spoken English and Chinese (including Mandarin) Proficient in Chinese Word Processing and  office applications, including Word, Excel and PowerPoint Good communication and interpersonal skills with all levels of staff Independent, committed, proactive, willing to learn, attentive to details, strong organizational skills, customer-oriented and high level of integrityÂ
Have you checked the "repeat mode" parameter and set it to "all"?
also I had to set the "min probability" to 0.1 to get reasonable annotations
Thanks for the suggestion. Repeat mode is set to "mark first in each section" for both tests (online demo & my own installation). Also tried setting minProbability to 0.1 but yielded same result (only one topic detected).
Do I need to regenerate the models based on new wikipedia dump? Or can I use the ones supplied with the toolkit 1.2.0?