Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#263 Import: Download and import in parallel, instead of serially

PFE
pfe
nobody
import (29)
2014-04-15
2013-11-29
gnosygnu
No

From Anselm D's comment in https://sourceforge.net/p/xowa/discussion/general/thread/72bb24b0/

Did you think about downloading, storing to a file (for later usage) and unpacking (with apache common) and "storing" to database in "one" step ("parallel")? This will reduce the time for installing, because the bottleneck should be the download.

This would allow a user to import a wiki faster while using no extra space.

Specifically, these are the scenarios

Scenario 1: Download, unzip, import:

  • Download .xml.bz2 data: 10 GB, 3 hours
  • Unpack the bz2 file to an xml file: 45 GB; 1 hour
  • Import the xml file: 25 GB; 2 hour
  • Total: 80 GB; 5 - 6 hours

Scenario 2: Download, import (directly from bz2 file):

  • Download .xml.bz2 data: 10 GB, 3 hours
  • Imports directly from the bz2 file: 25 GB; 4 - 5 hours
  • Total: 35 GB; 7 - 8 hours

Scenario 3: Import while downloading (proposed)

  • Open an HTTP connection to the .xml.bz2 data: 0 GB, 0 hours
  • Import while downloading. Specifically, the importer would read data in 8 MB chunks, and extract data from it: 25 GB; 5 hours
  • Total: 25 GB; 4 - 5 hours

Discussion

1 2 3 .. 9 > >> (Page 1 of 9)
  • Anselm D
    Anselm D
    2014-02-02

    IMHO even if you unpack with 7zip and directly pipe into your import process, you can import the wiki faster because you do not have to write the unpacked data to the disc.

    In may special case, i will have an additional performance effect, because i carry the .bz2 dump data to the offline computer with an usb2 device. So the unpacking is done on this device.

     
  • gnosygnu
    gnosygnu
    2014-02-03

    Thanks. That's an interesting idea, and it will be much simpler to implement than Scenario 3.

    I'll take a look at it this week, and post again here.

     
  • gnosygnu
    gnosygnu
    2014-02-11

    Sorry. I had planned to take a look at this over the weekend, but I got swamped by Italian wiki issues.

    I'll take another look at it this weekend.

     
  • gnosygnu
    gnosygnu
    2014-02-16

    I looked at it today, and the first results look pretty good. I'm going to try to get it included for tomorrow's build as an option (with a default for v1.2.4 / v1.3.1)

    Strangely, it's still slower than unzipping and importing. I suppose it might be b/c of extra signalling / marshalling between 7z and Java.

    Stats below.

    stats for simplewiki-latest-pages-articles.xml.bz2: 91.6 MB
    
    time  method
    ----  ------
      70  unzip bz2 to xml; import xml
      85  read directly from getInputStream on Process (7za e -so C:\wiki.bz2) 
     212  read from apache commons bz2 library
    
     
  • Anselm D
    Anselm D
    2014-02-17

    I would like to test it. Is it possible to test it without downloading? I downloaded it already.
    Estimated download time is 6h 30 minutes.

     
  • gnosygnu
    gnosygnu
    2014-02-17

    Yeah, just to be clear, this isn't Scenario 3 (import while downloading). This is basically a better bz2 importer. The current bz2 importer uses Apache Commons which turns out to be fairly slow.

    To test it, do the following:

    • Go to Help:Options/Import
    • Check "Import bz2 by stdout"
    • Make sure "Import bz2 by stdout process" looks correct (they should be)
    • (Optional for Help:Import/List) Change "Custom wiki commands" to "wiki.download,wiki.import"
    • Go to Help:Import/Script and pick your downloaded dump
    • Change "uncompresss" to "read from compressed dump".
    • Click Generate script. You should see something like the below. You want to make sure the last argument is "" not "unzip"
    • Click Run script

    Let me know if you run into issues. I tested it on both Windows and Linux, so it should be fine.

    // import wiki from dump file
    app.setup.cmds.cmd_add("wiki.dump_file", "C:\xowa\wiki\#dump\done\simplewiki-latest-pages-articles.xml.bz2", "simple.wikipedia.org", "");
    
     
  • Anselm D
    Anselm D
    2014-02-17

    Thank you, it works now.

    testing with: 9,77 GB (10.492.810.412 Bytes)
    enwiki-20140102-pages-articles.xml.bz2

    At this moment 30% are done.
    task manager says: CPU Time for
    javaw.exe 1h 42 minutes
    7za.exe 40 minutes

    Strangely, it's still slower than unzipping and importing. I suppose it might be b/c of extra signalling / marshalling between 7z and Java.

    Do you use BufferedInputStream?

     
    • gnosygnu
      gnosygnu
      2014-02-18

      Do you use BufferedInputStream?

      Nope. I just used a standard InputStream. I'm doing stream.read(byte[], pos, len) where len is a large number (about 8 MB). I believe BufferedInputStream only makes a difference if I'm retrieving less than 8 KB (or some other value).

      I tried with new BufferedInputStream(stream, 8 MB), but this made no difference in speed. (still areound 85 seconds)

       
      Last edit: gnosygnu 2014-02-18
      • Anselm D
        Anselm D
        2014-02-19

        I tried with new BufferedInputStream(stream, 8 MB), but this made no difference in speed. (still areound 85 seconds)

        It was worth a try.

        Nope. I just used a standard InputStream. I'm doing stream.read(byte[], pos, len) where len is a large number (about 8 MB). I believe BufferedInputStream only makes a difference if I'm retrieving less than 8 KB (or some other value).

        Ok, i think this is what BufferedInputStream basically does and some overhead to manage the buffer.

         
1 2 3 .. 9 > >> (Page 1 of 9)


Anonymous


Cancel   Add attachments