New toolkit, fixed extraction

1 2 3 .. 5 > >> (Page 1 of 5)
  • David Milne

    David Milne - 2010-10-20

    Hi All,

    I am very sorry for the many people who have been having trouble using the Perl scripts to extract data from the latest Wikipedia dumps. I've been neglecting this part of the toolkit for a long while, and it now falls apart in several places as Wikipedia's dumps have grown and their syntax has changed. Perl is great for quickly making run-once scripts, but I have found it to be horrible for writing understandable/maintainable code that other people have to use.

    In the svn repository there is a new branch called hadoopAndBerkeleyDb. This is a very different version of Wikipedia Miner that I encourage you guys to check out. The entire thing is Java-based now; no more Perl, and no more MySQL. The extraction process is now implemented using Hadoop, which is the toolkit people like Yahoo, FaceBook and Twitter use to deal with "Big Data". It's definitely a future-proof solution that will be able to scale as Wikipedia does. I have processed the en dump from September 2010 using this code without any problems.

    MySQL has been replaced with Berkely DB Java Edition, which is an embedded database built by Oracle. It performs much better than MySQL, because it runs within your WikipediaMiner application; there is none of the overhead involved with communicating with a separately running program. Additionally we no longer have the overhead of running a full relational database, whose features (ad-hoc queries, joins, etc) we weren't using anyway.

    I thought I should let you all know about this version of the toolkit now, since many of you are stuck. However, it is not yet ready for release and was not planning to migrate the public web server over to use it for another few weeks. There are some pros and cons if you decide to use it:


    • You can start using the latest Wikipedia dumps again.

    • Faster performance when working with the databases directly.

    • Sentence level indexing of links: it is now very efficient to extract sentences of markup that mention different topics. For example, you can ask for all sentences that contain links to both "Yahoo" and "Hadoop"; this is very useful for finding human-readable explanations for relations.

    • More options for caching to memory - you can now cache any database table to memory, and choose to prioritize speed (cache data directly) or space (compress data before caching).

    • Better organized web services, with more options and parameters.

    • One hub of web services can now service multiple dumps of Wikipedia (e.g. you could have one web server to handle both en and de dumps).


    • Patchy documentation; I'm still working through the JavaDoc.

    • Significant code changes; classes and methods have been shuffled around a fair bit.

    • More shuffling is likely in the near future.

    • File incompatibility; You can't use your old CSV files with this new version, or vice-versa.

    • No human-readable web demos; these need to be rebuilt to fit the new web services.

    Let me know if you check it out, and if you run into any problems. I expect there will be a few teething troubles from the lack of documentation, and I'll commit myself to checking this thread daily until these are resolved and the release is made.

  • bella

    bella - 2010-10-21

    when I import hadoopAndBerekleyDb into my project
    it gives me some error in
    in line :
    package org.wikipedia.miner.util;
    "declared package does not match the expected package ."
    when I change the declaration to , It gives
    me some errors in the other declaration

    could u tell me how can I extract dump files ?

  • David Milne

    David Milne - 2010-10-21

    This is a problem with whatever IDE you are importing the project into. There must be some way to tell it that the source files are all located within the src folder, and that packages should start from there. It is a very common convention. Try to fix it by changing the configuration of your IDE, not by changing the source code.

    I'll follow up tomorrow with some step-by-step instructions about how to extract the wikipedia data. For now, please look at the Hadoop single node and cluster setup guides: you need to get hadoop up and running in order run the new extraction stuff. 

  • Tassadar

    Tassadar - 2010-10-23

    I've import hadoopAndBerekleyDb into my project, and I'm looking for your instructions about how to extract the wikipedia data.
    thank you very much

  • Rafael Odon Alencar

    I have the older version working perfectly for the pt-wiki, and now I'm trying to run the new version… According to the past instructions from the 1.1. version, the first thing to do after to download and extract the xml dump is to perform an extraction step. As I see here, there is an extraction package in the new version, that somehow tries populate the berkeley DB with data extracted from the dump. Some questions:
    1. Which class should I run first?
    2. I tried to run the DumpExctractor main(), but I noticed the needing of a language.xml file. What's the expected schema for this one?


  • David Milne

    David Milne - 2010-11-14

    Hi All,

    Sorry for the delays again - I've recently shifted cities to work on a short-term project, so things have been a bit hectic.

    Full instructions for processing the Wikipedia dump using the new code is up

    Apologies, I left the language.xml file out accidentally. Just run svn update and look in the config to get it. Did you have to modify anything other than the language dependent variables at the top of the old extraction script to get the old toolkit working for portugese? These variables are exactly what the language.xml file is for. You will have to modify it, but this should be pretty self explanatory. Please have a look at this wiki page: I'd really appreciate it if you added an entry to it for pt-wiki.

  • Tassadar

    Tassadar - 2010-11-15

    I've tried the new toolkit but I encountered this problem:

    10/11/15 16:30:44 ERROR extraction.PageStep$Step1Mapper: Caught exception
    at org.wikipedia.miner.extraction.PageStep$
    at org.wikipedia.miner.extraction.PageStep$
    at org.apache.hadoop.mapred.MapTask.runOldMapper(
    at org.apache.hadoop.mapred.LocalJobRunner$
    10/11/15 16:30:44 WARN mapred.LocalJobRunner: job_local_0001
    at org.wikipedia.miner.extraction.PageStep$Step1Mapper.close(
    at org.apache.hadoop.mapred.MapTask.runOldMapper(
    at org.apache.hadoop.mapred.LocalJobRunner$
    10/11/15 16:30:45 INFO mapred.JobClient: Job complete: job_local_0001
    10/11/15 16:30:45 INFO mapred.JobClient: Counters: 0
    Exception in thread "main" Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(
    at org.wikipedia.miner.extraction.DumpExtractor.main(
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.lang.reflect.Method.invoke(
    at org.apache.hadoop.util.RunJar.main(

    A lot of NullPointerExceptions, and then job failed. Is it because I didn't configure something correctly?
    Thank you very much

  • David Milne

    David Milne - 2010-11-15

    Did you follow the Hadoop single node tutorial the instructions link to? Were you able to run the example at the end of the tutorial?

  • David Milne

    David Milne - 2010-11-15

    I've looked through the code, and it is likely failing because of a problem with the language configuration file. Is there an entry in this file for the language version you are trying to process?

  • Tassadar

    Tassadar - 2010-11-16

    thank you for reply
    One of the xml in /conf wasn't correctly configured.
    I've corrected it so the problem was solved.
    But I have another problem now.
    WARN Mapred.jobclient: Error reading task outputhttp://localhost.administrator_desktop:50060….
    things like that
    My current user is root
    I believe this again is some configuring problem, any ideas? thank you very much

  • Tassadar

    Tassadar - 2010-11-16

    By the way, If I stop the hadoop and restart it, then run the jar file, I'll have such errors:
    10/11/16 14:50:10 INFO hdfs.DFSClient: No node available for block: blk_-2729364633326125133_1005 file=/user/root/input/languages.xml
    10/11/16 14:50:10 INFO hdfs.DFSClient: Could not obtain block blk_-2729364633326125133_1005 from any node: No live nodes contain current block
    I have to format a new file system and put all the files into it again to avoid these errors
    Is it necessary? Or should I do something else other than just running the command "bin/" to restart hadoop?

  • Tassadar

    Tassadar - 2010-11-16

    Well, I used a cluster to run the jar file, and I didn't encountered the previous problem
    But I have "can't allocate memory" error now…
    it seems that the jar file needs a lot of memory? How much memory does it need?

  • Tassadar

    Tassadar - 2010-11-16

    The computers of the cluster seem to be down now…
    All the momery are taken up
    I don't know why…
    I think I'd like to try the single node one again, but it still have the  Error reading task output and No node available for block problem…

  • David Milne

    David Milne - 2010-11-16


    1) The DumpExtractor can't be run on a single node cluster  - I just realized this yesterday, and have updated the wiki.

    2) Once you have Hadoop running properly, you'll be able to keep it running, at least until you have finished what you are trying to do. You don't have to restart it for every job. The problem right now is that you keep restarting with different configuration settings, in which case you can't expect the distributed filesystem to persist across restarts. I recommend you get Hadoop running for small tasks first (like the ones in the tutorials) before you try the DumpExtractor, which involves copying Gbs of data onto a file system that you aren't sure you have configured correctly yet. Again, have you successfully run the examples at the end of the Hadoop getting started tutorials?

    3) Currently I am processing the full English dump, and allowing Hadoop to use 3G of memory (you can adjust this using the HADOOP_HEAPSIZE option in conf/ file in your Hadoop installation). The default is 1G, and it sounds like Hadoop is struggling to find that much on the machines in your cluster. How much memory is on these machines, and what else is running on them?

  • Tassadar

    Tassadar - 2010-11-17

    Hi, Dave,
    thank you for your reply
    (1)So I have to run the DumpExtractor on a cluster?
    (2) I could run the examples the day before yesterday, when the configuration was not correctly for sure.
    But now it seems fail because of the following errors.
    When I put conf into input, I have
    10/11/17 11:55:53 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: File /user/root/input/conf/ could only be replicated to 0 nodes, instead of 1
    10/11/17 11:55:53 WARN hdfs.DFSClient: Error Recovery for block null bad datanode nodes == null
    10/11/17 11:55:53 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/root/input/conf/" - Aborting…
    put: File /user/root/input/conf/ could only be replicated to 0 nodes, instead of 1
    10/11/17 11:55:53 ERROR hdfs.DFSClient: Exception closing file /user/root/input/conf/ : org.apache.hadoop.ipc.RemoteException: File /user/root/input/conf/ could only be replicated to 0 nodes, instead of 1
    But I can still find conf in input, so I run the example, and I have
    10/11/17 11:57:52 INFO mapred.FileInputFormat: Total input paths to process : 1 Not a file: hdfs://localhost:9000/user/root/input/conf
    However, if I input a file, for example 1.txt into input, the example jar file seems work successfully.
    (3)The memory on the machine of the cluster is very little, just 2G, I think it was set up on some PCs. I think nothing else was running on them, I'm going to confirm it today.

  • marjan h

    marjan h - 2010-11-17

    Hi Dave
    would you please upload the extracted files same as before ?

  • Tassadar

    Tassadar - 2010-11-17

    Hi, after I restart all the computers in the cluster, and run,
    I met "No node available for block" problem again…
    This cluster is well configured and has been used for some time
    Is it because some node is disconnected?

  • Tassadar

    Tassadar - 2010-11-17

    I'm sorry, I think I know what causes the problem above
    Maybe it's because I forgot to mount the disk containing the data after I restarted the os
    Are you running the extractor on 64-bit os?

  • David Milne

    David Milne - 2010-11-17


    It doesn't sound like your cluster is well configured at all. The "no node available for block" and "could only be replicated to 0 nodes" errors mean that it can't use a single machine for the distributed file system. I'm really not qualified to help you with setting up hadoop, and this isn't the right forum to ask about it. It sounds like you are using a hadoop cluster that someone else set up, are they around to ask for help?

    @mhosseinia: Will do, but bear in mind the file formats may change If I need to extract more information.

  • mcrp now

    mcrp now - 2010-11-17

    Could you please create a Wikipedia-miner_X.X package for the download section like other versions as I am having trouble getting this SVN version complied. 

    Thank you

  • David Milne

    David Milne - 2010-11-18

    Hi Jason,

    As I said at the top of the thread, this isn't quite ready for a release yet. What problems are you having? It should just be a matter of checking it out and compiling a jar.

  • Tassadar

    Tassadar - 2010-11-18

    Hi Dave
    Thank you for your reply
    I'll try to configure another cluster or get more ram for the computers of that cluster
    On 64-bit OS, will 4G RAM enough for  DumpExtractor?

  • David Milne

    David Milne - 2010-11-18

    I have not tested this, but yes, I expect so. If memory does become an issue you can also run hadoop in 32 bit mode

    In the config file, specify:

    export HADOOP_OPTS = -d32
  • Tassadar

    Tassadar - 2010-11-19

    Thank you Dave, I'll try that
    I think I will configure a new cluster for the extractor
    If I run the DumpExtractor successfully, what the result will be, some CSV files?
    How can I put them into Database?

1 2 3 .. 5 > >> (Page 1 of 5)

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks