Iterating for all Articles

Help
Rafa Haro
2011-09-29
2013-06-10
  • Rafa Haro

    Rafa Haro - 2011-09-29

    Hi all,

    In a first test of the toolkit, I'm trying to iterate over all Wikipedia pages. I have downloaded the complete dump and generated the database correctly. This is the code that I'm testing:

    WikipediaConfiguration conf = new WikipediaConfiguration(new File("wikipedia-template.xml")) ;
            conf.clearDatabasesToCache();
                    
            Wikipedia wikipedia = new Wikipedia(conf, false);
                    
            PageIterator it = wikipedia.getPageIterator();
            
            int i = 0;
            while(it.hasNext()){
                Page next = it.next();
                System.out.println(next.getTitle());
                System.out.println(it.hasNext() + " " + (++i));
                            
            }
            it.close();
                
            wikipedia.close()
    

    I was expecting to print all pages' titles. But, surprisingly, the loop always stops after 65198 pages with title "Porter County, Indiana".

    Also, I'm getting a NullPointerException filtering the Iterator by article in this way:

    PageIterator it = wikipedia.getPageIterator(Page.PageType.article);
    

    Exception in thread "main" java.lang.NullPointerException
    at org.wikipedia.miner.util.PageIterator.queueNext(Unknown Source)
    at org.wikipedia.miner.util.PageIterator.next(Unknown Source)

    It stops at the penultimate page with title "Posey County, Indiana". It seems like there is a problem in PageIterator.next() method with the last element when you filter by PageType.

    Anyway, I was wondering why I'm getting very few articles. I need to crawl the whole Wikipedia.

    Thanks a lot in advance

     
  • David Milne

    David Milne - 2011-09-29

    Hi,

    I'm not sure about the iterator stopping early (will look into this today) but the NullPointer exception is a bug that has been fixed in the project trunk. I'm working on a new release that will include this fix, but this is at least a few days away. In the meantime you can grab the latest code using svn:

    svn co https://wikipedia-miner.svn.sourceforge.net/svnroot/wikipedia-miner/trunk wikipedia-miner
    

    You will need to rerun the build-database target unfortunately.

     
    • kanika

      kanika - 2013-06-10

      Hi,

      I am also getting the same error, I am using 1.2 version of wikipedia miner. Is there a recent version in which the error has been fixed?

      Thanks a lot in advance.

       
      Last edit: kanika 2013-06-10
  • Rafa Haro

    Rafa Haro - 2011-09-30

    Hi,

    I just grab the current subversion code, rebuild the database and everything is working fine now. Thank you very much.

    I'm going to let you know how is going my work with the toolkit. I think I will need to use it a lot.

    Best Regards

     
  • Anonymous - 2013-03-21

    Hello there,
    Same problem here. I downloaded the code from the svn rebuid the database via ant with no luck. How do you solve that problem?

    Bests!

     
  • Anonymous - 2013-03-22

    Hello,
    Just write to confirm that it works with the lastest svn code. Maybe yesterday I mixed the database or sometihing.

    Thanks!

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks