Wikipedia Miner Toolkit / Discussion / Help: New toolkit, fixed extraction

Scott Weinert - 2011-06-27

Once again, very helpful. Thank you for keeping up with the forum. Let use know when those instructions for the web services are up!

-Scott

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Milne - 2011-06-27

Up.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-27

I am trying to build the database, however I run into the following error when running the WikipediaBuilder class:

Exception in thread "main" com.sleepycat.je.DatabaseNotFoundException: (JE 4.0.103) Database label not found.
        at com.sleepycat.je.Environment.setupDatabase(Environment.java:790)
        at com.sleepycat.je.Environment.openDatabase(Environment.java:536)
        at org.wikipedia.miner.db.WDatabase.getDatabase(WDatabase.java:573)
        at org.wikipedia.miner.db.WDatabase.getDatabaseSize(WDatabase.java:247)
        at org.wikipedia.miner.db.LabelDatabase.prepare(LabelDatabase.java:185)
        at org.wikipedia.miner.db.WEnvironment.prepareTextProcessor(WEnvironment.java:755)
        at org.wikipedia.miner.db.WikipediaBuilder.main(WikipediaBuilder.java:21)

I made sure that the label csv file exists. I used the template config file and saved it under configs/en.xml and I set the following directives:

        <!- MANDATORY: The language code of this wikipedia version (e.g. en, de, simple). ->
        <langCode>en</langCode>

        <!- MANDATORY: A directory containing a complete berkeley database. ->
        <databaseDirectory>/usr/local/wordcloud/data</databaseDirectory>

    <!- A directory containing csv files extracted from a wikipedia dump. Caching will be faster if these are available. ->
    <dataDirectory>/usr/local/wordcloud/csv</dataDirectory>

Any ideas? Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-27

Before you reply, let me get the latest code and try that. Sorry!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Milne - 2011-06-27

Yup, latest code will fix that. Also make sure the uncompressed wiki dump is in that /usr/local/wordcloud/csv folder.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-27

Okay now I get this:

root@Wordcloud:/usr/local/wordcloud# java -jar build.jar configs/en.xml
Exception in thread "main" java.io.IOException: Could not locate markup file in /usr/local/wordcloud/csv
        at org.wikipedia.miner.db.WEnvironment.getMarkupDataFile(WEnvironment.java:785)
        at org.wikipedia.miner.db.WEnvironment.buildEnvironment(WEnvironment.java:678)
        at org.wikipedia.miner.db.WikipediaBuilder.main(WikipediaBuilder.java:37)

The uncompressed dump file is in the csv folder as "dump.xml" - should it be named something else?

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-27

Okay - I found it. The dump file needs to end in "-pages-articles.xml" due to this line in the WEnvironment.java:

return name.endsWith("-pages-articles.xml") ;

Might want to add that to the docs for other java rookies like me ;)

It is building the database! Thanks for the support. I will let you know how the web services go.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-27

I followed the instructions for the deployment and the homepage on tomcat comes up fine. Link is at http://http://209.114.39.233/wordcloud/

When I click on the services, for example, Wikify, I get a 404:

type Status report

message /wordcloud/service

description The requested resource (/wordcloud/service) is not available.

I included the packaged jar file in the WEB-INF/lib folder, but it seems that the services are not running.

Any help would be appreciated.

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Edgar Meij - 2011-06-27

It seems they are running, but that you are missing some paths, see http://209.114.39.233/wordcloud/wikify?wrapInXml=false&showTooltips=false&sourceMode=0&repeatMode=2&minProbability=&bannedTopics=&source=bipolar+depression. In particular you need to specify the location of the machine learning models (Disambiguator a.o.) that are included in the source tree in your config file.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-27

Okay - I understand. I found all of the models and entered their paths inside my XML config file.

Question about <defaultTextProcessor></defaultTextProcessor> - Is this required to run the web services? If so, I am not sure I understand how to set it up properly based on the comments in the config file.

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Edgar Meij - 2011-06-27

No, I think that one defaults to a standard processor.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-27

Okay - successfully built the database and I added all of the models inside the config file, but I am still getting an error about it not being able to find a weka class. See the error here:

http://209.114.39.233/wordcloud/wikify?wrapInXml=false&showTooltips=false&sourceMode=0&repeatMode=2&minProbability=&bannedTopics=&source=bipolar+depression

Also, if you got http://209.114.39.233/wordcloud/ and click any of the services, you get a 404.

Any ideas on where to go from here? I made sure weka is inside my WEB-INF/lib folder…

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-27

Just in case I need to shutdown tomcat, error is:

javax.servlet.ServletException: java.lang.ClassNotFoundException: weka.wrapper.TypedAttribute
org.wikipedia.miner.service.ServiceHub.<init>(ServiceHub.java:86)
org.wikipedia.miner.service.ServiceHub.getInstance(ServiceHub.java:95)
org.wikipedia.miner.service.Service.init(Service.java:101)
org.wikipedia.miner.service.WikifyService.init(WikifyService.java:63)
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:563)
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:399)
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:317)
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:204)
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:182)
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:311)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Milne - 2011-06-28

Oh, right. You are missing the weka-wrapper jar file, available here. You need to put this in the web/WEB-INF/lib directory.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Milne - 2011-06-28

About text processors, these just do things like case-folding and stemming when looking up terms. You don't need to use one, but I generally use org.wikipedia.miner.util.text.CaseFolder. If you specify one, you will need to preprepare the database, by calling the static method WEnvironment.prepareTextProcessor() with the appropriate arguments.

By the way, the arguments to the services have been updated, so what you have there won't work as you expect. Look at the services.html page I talk about in the instructions for documentation.

Also, you should use the search service for small snippets of text like that. The wikify service is designed to take at least sentence-sized, but more like paragraph-sized or article sized chunks of text as input.

Oh, and If you really want to use the wikify service, make sure you cache label and pageLinksIn databases otherwise it will run abysmally slowly.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-28

Dave-

Including that Jar did the trick! I seem to be getting closer - just a couple questions.

1) It seems to run abysmally slow, like you predicted, even though I included the pageLinksIn and label in the cache with the following lines:

<databaseToCache priority="space">pageLinksIn</databaseToCache>
<databaseToCache priority="space">label</databaseToCache>

2) I still get 404 errors when I click on any of the services. How do I fix this?

But it seems it be working, for example:
http://209.114.39.233/wordcloud/wikify?minProbability=.01&source=I%20like%20to%20watch%20the%20Lord%20of%20the%20Rings%20trilogy%20and%20my%20favorite%20band%20is%20The%20Shins.

Thanks for your help. I really enjoy this project and I sure am learning a lot!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-28

Okay, I found the http://209.114.39.233/wordcloud/services.html that describes the services. So my real question is just about speed and how to increase performance. Right now I am allowing 3G of ram to the heap. Would increasing it make a large difference?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Edgar Meij - 2011-06-28

More mem is always good. My Tomcat running wikipedia-miner has a heap of 10GB, 5 of which is currently in use…

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-29

Edgar and Dave-

I have experimented with different memory sizes, even trying 15GB of RAM however it is still just super slow. Are there any other techniques I should explore in order to improve performance?

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-06-29

Just to give a couple details, the wikify service is what runs extremely slow and the new version (berkeley) actually runs much slower than my install of the older mysql-based version.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Milne - 2011-06-30

Oh, sorry this is due to a recent change in the code. Try configuring the xml to be:

<databaseToCache priority="space">pageLinksInNoSentences</databaseToCache>
<databaseToCache priority="space">label</databaseToCache>

As long as you have enough RAM to cache these databases without running out of java heap space-unless you actually have a fatal exception-then upping the ram won't help you. 2-3G should be plenty.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Milne - 2011-06-30

P.S. The wiki page has been updated to talk about what databases to cache to memory, and how much memory to set aside for them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-07-01

Dave-

Thank you for your response. I will give this a try and let you know how it turns out.

One concern - is there supposed to be a pageLinksInNoSentences.csv? I have the pageLinkIn.csv but not the one you refer to. Do I need to redo my hadoop extraction from the dump file?

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Milne - 2011-07-01

No concern needed - will work without re-running the extraction stuff.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Weinert - 2011-07-01

I made the changes in the XML and followed some guides online to tweak tomcat for better performance. I also set my heap size to a little over 3G. Performance seems to have increased, but it is still painfully slow. Your services run exponentially faster.

For example, try this simple example of the wikify service on my machine - you will see what I mean.

It seems that Tomcat never uses more than 15% of user CPU. I made sure to remove other Tomcat apps as well as any other service running on my machine that could possibly slow things down.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

New toolkit, fixed extraction

Forums

Help

New toolkit, fixed extraction document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

New toolkit, fixed extraction