Menu

New toolkit, fixed extraction

Help
2010-10-20
2013-11-21
<< < 1 2 3 4 5 > >> (Page 3 of 5)
  • Scott Weinert

    Scott Weinert - 2011-06-27

    Once again, very helpful. Thank you for keeping up with the forum. Let use know when those instructions for the web services are up!

    -Scott

     
  • David Milne

    David Milne - 2011-06-27
     
  • Scott Weinert

    Scott Weinert - 2011-06-27

    I am trying to build the database, however I run into the following error when running the WikipediaBuilder class:

    Exception in thread "main" com.sleepycat.je.DatabaseNotFoundException: (JE 4.0.103) Database label not found.
            at com.sleepycat.je.Environment.setupDatabase(Environment.java:790)
            at com.sleepycat.je.Environment.openDatabase(Environment.java:536)
            at org.wikipedia.miner.db.WDatabase.getDatabase(WDatabase.java:573)
            at org.wikipedia.miner.db.WDatabase.getDatabaseSize(WDatabase.java:247)
            at org.wikipedia.miner.db.LabelDatabase.prepare(LabelDatabase.java:185)
            at org.wikipedia.miner.db.WEnvironment.prepareTextProcessor(WEnvironment.java:755)
            at org.wikipedia.miner.db.WikipediaBuilder.main(WikipediaBuilder.java:21)

    I made sure that the label csv file exists. I used the template config file and saved it under configs/en.xml and I set the following directives:

            <!- MANDATORY: The language code of this wikipedia version (e.g. en, de, simple). ->
            <langCode>en</langCode>

            <!- MANDATORY: A directory containing a complete berkeley database. ->
            <databaseDirectory>/usr/local/wordcloud/data</databaseDirectory>

        <!- A directory containing csv files extracted from a wikipedia dump. Caching will be faster if these are available. ->
        <dataDirectory>/usr/local/wordcloud/csv</dataDirectory>

    Any ideas? Thanks.

     
  • Scott Weinert

    Scott Weinert - 2011-06-27

    Before you reply, let me get the latest code and try that. Sorry!

     
  • David Milne

    David Milne - 2011-06-27

    Yup, latest code will fix that. Also make sure the uncompressed wiki dump is in that /usr/local/wordcloud/csv folder.

     
  • Scott Weinert

    Scott Weinert - 2011-06-27

    Okay now I get this:

    root@Wordcloud:/usr/local/wordcloud# java -jar build.jar configs/en.xml
    Exception in thread "main" java.io.IOException: Could not locate markup file in /usr/local/wordcloud/csv
            at org.wikipedia.miner.db.WEnvironment.getMarkupDataFile(WEnvironment.java:785)
            at org.wikipedia.miner.db.WEnvironment.buildEnvironment(WEnvironment.java:678)
            at org.wikipedia.miner.db.WikipediaBuilder.main(WikipediaBuilder.java:37)

    The uncompressed dump file is in the csv folder as "dump.xml" - should it be named something else?

    Thanks

     
  • Scott Weinert

    Scott Weinert - 2011-06-27

    Okay - I found it. The dump file needs to end in "-pages-articles.xml" due to this line in the WEnvironment.java:

    return name.endsWith("-pages-articles.xml") ;

    Might want to add that to the docs for other java rookies like me ;)

    It is building the database! Thanks for the support. I will let you know how the web services go.

     
  • Scott Weinert

    Scott Weinert - 2011-06-27

    I followed the instructions for the deployment and the homepage on tomcat comes up fine. Link is at http://http://209.114.39.233/wordcloud/

    When I click on the services, for example, Wikify, I get a 404:

    type Status report

    message /wordcloud/service

    description The requested resource (/wordcloud/service) is not available.

    I included the packaged jar file in the WEB-INF/lib folder, but it seems that the services are not running.

    Any help would be appreciated.

    Thanks

     
  • Scott Weinert

    Scott Weinert - 2011-06-27

    Okay - I understand. I found all of the models and entered their paths inside my XML config file.

    Question about <defaultTextProcessor></defaultTextProcessor> - Is this required to run the web services? If so, I am not sure I understand how to set it up properly based on the comments in the config file.

    Thanks

     
  • Edgar Meij

    Edgar Meij - 2011-06-27

    No, I think that one defaults to a standard processor.

     
  • Scott Weinert

    Scott Weinert - 2011-06-27

    Just in case I need to shutdown tomcat, error is:

    javax.servlet.ServletException: java.lang.ClassNotFoundException: weka.wrapper.TypedAttribute
    org.wikipedia.miner.service.ServiceHub.<init>(ServiceHub.java:86)
    org.wikipedia.miner.service.ServiceHub.getInstance(ServiceHub.java:95)
    org.wikipedia.miner.service.Service.init(Service.java:101)
    org.wikipedia.miner.service.WikifyService.init(WikifyService.java:63)
    org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
    org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
    org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:563)
    org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:399)
    org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:317)
    org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:204)
    org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:182)
    org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:311)
    java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    java.lang.Thread.run(Thread.java:662)

     
  • David Milne

    David Milne - 2011-06-28

    Oh, right. You are missing the weka-wrapper jar file, available here. You need to put this in the web/WEB-INF/lib directory.

     
  • David Milne

    David Milne - 2011-06-28

    About text processors, these just do things like case-folding and stemming when looking up terms. You don't need to use one, but I generally use org.wikipedia.miner.util.text.CaseFolder. If you specify one, you will need to preprepare the database, by calling the static method WEnvironment.prepareTextProcessor() with the appropriate arguments.

    By the way, the arguments to the services have been updated, so what you have there won't work as you expect. Look at the services.html page I talk about in the instructions for documentation.

    Also, you should use the search service for small snippets of text like that. The wikify service is designed to take at least sentence-sized, but more like paragraph-sized or article sized chunks of text as input.

    Oh, and If you really want to use the wikify service, make sure you cache label and pageLinksIn databases otherwise it will run abysmally slowly.

     
  • Scott Weinert

    Scott Weinert - 2011-06-28

    Dave-

    Including that Jar did the trick! I seem to be getting closer - just a couple questions.

    1) It seems to run abysmally slow, like you predicted, even though I included the pageLinksIn and label in the cache with the following lines:

    <databaseToCache priority="space">pageLinksIn</databaseToCache>
    <databaseToCache priority="space">label</databaseToCache>

    2) I still get 404 errors when I click on any of the services. How do I fix this?

    But it seems it be working, for example:
    http://209.114.39.233/wordcloud/wikify?minProbability=.01&source=I%20like%20to%20watch%20the%20Lord%20of%20the%20Rings%20trilogy%20and%20my%20favorite%20band%20is%20The%20Shins.

    Thanks for your help. I really enjoy this project and I sure am learning a lot!

     
  • Scott Weinert

    Scott Weinert - 2011-06-28

    Okay, I found the http://209.114.39.233/wordcloud/services.html that describes the services. So my real question is just about speed and how to increase performance. Right now I am allowing 3G of ram to the heap. Would increasing it make a large difference?

     
  • Edgar Meij

    Edgar Meij - 2011-06-28

    More mem is always good. My Tomcat running wikipedia-miner has a heap of 10GB, 5 of which is currently in use…

     
  • Scott Weinert

    Scott Weinert - 2011-06-29

    Edgar and Dave-

    I have experimented with different memory sizes, even trying 15GB of RAM however it is still just super slow. Are there any other techniques I should explore in order to improve performance?

    Thanks

     
  • Scott Weinert

    Scott Weinert - 2011-06-29

    Just to give a couple details, the wikify service is what runs extremely slow and the new version (berkeley) actually runs much slower than my install of the older mysql-based version.

     
  • David Milne

    David Milne - 2011-06-30

    Oh, sorry this is due to a recent change in the code. Try configuring the xml to be:

    <databaseToCache priority="space">pageLinksInNoSentences</databaseToCache>
    <databaseToCache priority="space">label</databaseToCache>

    As long as you have enough RAM to cache these databases without running out of java heap space-unless you actually have a fatal exception-then upping the ram won't help you. 2-3G should be plenty.

     
  • David Milne

    David Milne - 2011-06-30

    P.S. The wiki page has been updated to talk about what databases to cache to memory, and how much memory to set aside for them.

     
  • Scott Weinert

    Scott Weinert - 2011-07-01

    Dave-

    Thank you for your response. I will give this a try and let you know how it turns out.

    One concern - is there supposed to be a pageLinksInNoSentences.csv? I have the pageLinkIn.csv but not the one you refer to. Do I need to redo my hadoop extraction from the dump file?

    Thanks

     
  • David Milne

    David Milne - 2011-07-01

    No concern needed - will work without re-running the extraction stuff.

     
  • Scott Weinert

    Scott Weinert - 2011-07-01

    I made the changes in the XML and followed some guides online to tweak tomcat for better performance. I also set my heap size to a little over 3G. Performance seems to have increased, but it is still painfully slow. Your services run exponentially faster.

    For example, try this simple example of the wikify service on my machine - you will see what I mean.

    It seems that Tomcat never uses more than 15% of user CPU. I made sure to remove other Tomcat apps as well as any other service running on my machine that could possibly slow things down.

     
<< < 1 2 3 4 5 > >> (Page 3 of 5)

Log in to post a comment.