In a first test of the toolkit, I'm trying to iterate over all Wikipedia pages. I have downloaded the complete dump and generated the database correctly. This is the code that I'm testing:
WikipediaConfiguration conf = new WikipediaConfiguration(new File("wikipedia-template.xml")) ;
conf.clearDatabasesToCache();
Wikipedia wikipedia = new Wikipedia(conf, false);
PageIterator it = wikipedia.getPageIterator();
int i = 0;
while(it.hasNext()){
Page next = it.next();
System.out.println(next.getTitle());
System.out.println(it.hasNext() + " " + (++i));
}
it.close();
wikipedia.close()
I was expecting to print all pages' titles. But, surprisingly, the loop always stops after 65198 pages with title "Porter County, Indiana".
Also, I'm getting a NullPointerException filtering the Iterator by article in this way:
PageIterator it = wikipedia.getPageIterator(Page.PageType.article);
Exception in thread "main" java.lang.NullPointerException
at org.wikipedia.miner.util.PageIterator.queueNext(Unknown Source)
at org.wikipedia.miner.util.PageIterator.next(Unknown Source)
It stops at the penultimate page with title "Posey County, Indiana". It seems like there is a problem in PageIterator.next() method with the last element when you filter by PageType.
Anyway, I was wondering why I'm getting very few articles. I need to crawl the whole Wikipedia.
Thanks a lot in advance
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not sure about the iterator stopping early (will look into this today) but the NullPointer exception is a bug that has been fixed in the project trunk. I'm working on a new release that will include this fix, but this is at least a few days away. In the meantime you can grab the latest code using svn:
svn co https://wikipedia-miner.svn.sourceforge.net/svnroot/wikipedia-miner/trunk wikipedia-miner
You will need to rerun the build-database target unfortunately.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
In a first test of the toolkit, I'm trying to iterate over all Wikipedia pages. I have downloaded the complete dump and generated the database correctly. This is the code that I'm testing:
I was expecting to print all pages' titles. But, surprisingly, the loop always stops after 65198 pages with title "Porter County, Indiana".
Also, I'm getting a NullPointerException filtering the Iterator by article in this way:
Exception in thread "main" java.lang.NullPointerException
at org.wikipedia.miner.util.PageIterator.queueNext(Unknown Source)
at org.wikipedia.miner.util.PageIterator.next(Unknown Source)
It stops at the penultimate page with title "Posey County, Indiana". It seems like there is a problem in PageIterator.next() method with the last element when you filter by PageType.
Anyway, I was wondering why I'm getting very few articles. I need to crawl the whole Wikipedia.
Thanks a lot in advance
Hi,
I'm not sure about the iterator stopping early (will look into this today) but the NullPointer exception is a bug that has been fixed in the project trunk. I'm working on a new release that will include this fix, but this is at least a few days away. In the meantime you can grab the latest code using svn:
You will need to rerun the build-database target unfortunately.
Hi,
I am also getting the same error, I am using 1.2 version of wikipedia miner. Is there a recent version in which the error has been fixed?
Thanks a lot in advance.
Last edit: kanika 2013-06-10
Hi,
I just grab the current subversion code, rebuild the database and everything is working fine now. Thank you very much.
I'm going to let you know how is going my work with the toolkit. I think I will need to use it a lot.
Best Regards
Hello there,
Same problem here. I downloaded the code from the svn rebuid the database via ant with no luck. How do you solve that problem?
Bests!
Hello,
Just write to confirm that it works with the lastest svn code. Maybe yesterday I mixed the database or sometihing.
Thanks!