Processing Wikipedia Dumps in Russian

aly123
2010-01-21
2013-05-30
  • aly123
    aly123
    2010-01-21

    Hi,

    So, I had some success with using extraction/extractWikipediaData.pl to generate cvs tables for Russian.

    The link to Parse Media Wiki Dump didn't work any more, so I downloaded the current version from here:
    http://search.cpan.org/dist/Parse-MediaWikiDump/

    They have changed the code a bit, so the perl script needed modifying:
    Lines 207, 347, 460, 1404 should be changed from :
      

    while(defined($page = $pages->page)) {
    

    to
      

    while(defined($page = $pages->next)) {
    

    Then there were no problems generating most of cvs. but eventually the program stopped giving me these errors:

    Complex regular subexpression recursion limit (32766) exceeded at extractWikipediaData.pl line 1478.
    Complex regular subexpression recursion limit (32766) exceeded at extractWikipediaData.pl line 1478.
    Complex regular subexpression recursion limit (32766) exceeded at extractWikipediaData.pl line 1478.
    Complex regular subexpression recursion limit (32766) exceeded at extractWikipediaData.pl line 1478.
    Out of memory during request for 4088 bytes, total sbrk() is 401209344 bytes!
    Out of memory during request for 3208 bytes, total sbrk() is 401209344 bytes!
    Callback called exit at /usr/lib/perl5/5.10/Carp.pm line 39.
            (in cleanup) Goto undefined subroutine &Carp::shortmess_real at /usr/lib/perl5/5.10/Carp.pm line 41 during globa
    l destruction.

    Files that were not generated are: equivalence and categorylinks. And i think it stopped somewhere in the middle of disambiguation. Any ideas how to fix it?

    Cheers
    Aly

     
  • Viktor Seifert
    Viktor Seifert
    2010-03-16

    I do have similar problems.  Im using the german wikipedia.
    I posted about my problems in the 'Help' forum.

    I think I may know the reason why there is a 'Out of memory'; are you running perl on windows or linux?

    After searching searching a while for the 'Complex regular subexpression recursion limit' problem, I found that the only availible solution is to rewrite the regular expressions.  Apart from that the only possible solution would be to dig in to the source of perls regular expression engine.
    There is a related post: https://sourceforge.net/projects/wikipedia-miner/forums/forum/676405/topic/3564557

    Also you should have a look into the 'log.txt'-file generated during the extraction.  I get a lot of 'Cannot resolve…' messages in there.  And I don't know what effect they have on the cvs-files generated.

    I'm using the Maui-Indexer in my bachelor thesis, which I think needs an WikipediaMiner database.  So any help on this is highly welcome.

    Cheers, Viktor

     
  • aly123
    aly123
    2010-03-17

    Hi,

    I ran the scripts on Windows, what about you?
    I figured the machine is not very powerful and left the project for now..

    Thanks for the pointers to fix the complex reg expression errors..

    Btw, I am the author of Maui-Indexer and can tell you that depending on your task it will work without WikipediaMiner just fine.
    Let me know what your project is about and perhaps I can help you with it. Write my lastname @gmail.

    Alyona

     
  • Viktor Seifert
    Viktor Seifert
    2010-03-18

    Hello.

    I ran them on a windows 64bit machine which has a lot of memory.  So I can tell you it's not a lack of memory which causes the 'out of memory' error.
    I looked into the makefile of perl and think that the error might come from a c-compiler flag which is set in a windows build but not in a linux build.

    Finally I ran the perl-programms in a 'Virtual PC' on the same machine and it could complete the extraction.
    Since the virtual pc can only access 2gig of memory a machine with at least that much shuld be enough to ru the extraction.

    So I think you have 3 options here:
    1. Recompile perl removing the compiler flag first, then run the extraction
    2. Run the extraction on a linux machine
    3. Run the extraction a virtual pc
    'Virtual PC' has a free download version btw.

    You might run into other problems though.  Contact me if you need help, you should have my email adress soon.

    Cheers, Viktor

     
  • I'm processing the Portuguese Wikipedia and i'm getting some "Complex regular subexpression recursion limit" also. As I've read, this warning occur when the the regular expression matcher reaches a  maximum recursion stack size (that is 32766).. After that Pearl stops looking for new matches in the string.

     
  • It doesn't seem to interfere in the process, although it might loose some anchor text countings. But I guess this occur only with very complex and hard to find regex, what of course would lead the counting close to zero, even if we could solve this issue. So, i don't think that will interfere in the sucess rate of the later steps of the Wikification process for example. What do you think about it?

     
  • Viktor Seifert
    Viktor Seifert
    2010-03-29

    I too don't think it makes a big difference.