#263 Import: Download and import in parallel, instead of serially

PFE
pfe
nobody
import (31)
2014-12-25
2013-11-29
gnosygnu
No

From Anselm D's comment in https://sourceforge.net/p/xowa/discussion/general/thread/72bb24b0/

Did you think about downloading, storing to a file (for later usage) and unpacking (with apache common) and "storing" to database in "one" step ("parallel")? This will reduce the time for installing, because the bottleneck should be the download.

This would allow a user to import a wiki faster while using no extra space.

Specifically, these are the scenarios

Scenario 1: Download, unzip, import:

  • Download .xml.bz2 data: 10 GB, 3 hours
  • Unpack the bz2 file to an xml file: 45 GB; 1 hour
  • Import the xml file: 25 GB; 2 hour
  • Total: 80 GB; 5 - 6 hours

Scenario 2: Download, import (directly from bz2 file):

  • Download .xml.bz2 data: 10 GB, 3 hours
  • Imports directly from the bz2 file: 25 GB; 4 - 5 hours
  • Total: 35 GB; 7 - 8 hours

Scenario 3: Import while downloading (proposed)

  • Open an HTTP connection to the .xml.bz2 data: 0 GB, 0 hours
  • Import while downloading. Specifically, the importer would read data in 8 MB chunks, and extract data from it: 25 GB; 5 hours
  • Total: 25 GB; 4 - 5 hours

Discussion

1 2 3 4 > >> (Page 1 of 4)
  • Anselm D

    Anselm D - 2014-02-02

    IMHO even if you unpack with 7zip and directly pipe into your import process, you can import the wiki faster because you do not have to write the unpacked data to the disc.

    In may special case, i will have an additional performance effect, because i carry the .bz2 dump data to the offline computer with an usb2 device. So the unpacking is done on this device.

     
  • gnosygnu

    gnosygnu - 2014-02-03

    Thanks. That's an interesting idea, and it will be much simpler to implement than Scenario 3.

    I'll take a look at it this week, and post again here.

     
  • gnosygnu

    gnosygnu - 2014-02-11

    Sorry. I had planned to take a look at this over the weekend, but I got swamped by Italian wiki issues.

    I'll take another look at it this weekend.

     
  • gnosygnu

    gnosygnu - 2014-02-16

    I looked at it today, and the first results look pretty good. I'm going to try to get it included for tomorrow's build as an option (with a default for v1.2.4 / v1.3.1)

    Strangely, it's still slower than unzipping and importing. I suppose it might be b/c of extra signalling / marshalling between 7z and Java.

    Stats below.

    stats for simplewiki-latest-pages-articles.xml.bz2: 91.6 MB
    
    time  method
    ----  ------
      70  unzip bz2 to xml; import xml
      85  read directly from getInputStream on Process (7za e -so C:\wiki.bz2) 
     212  read from apache commons bz2 library
    
     
  • Anselm D

    Anselm D - 2014-02-17

    I would like to test it. Is it possible to test it without downloading? I downloaded it already.
    Estimated download time is 6h 30 minutes.

     
  • gnosygnu

    gnosygnu - 2014-02-17

    Yeah, just to be clear, this isn't Scenario 3 (import while downloading). This is basically a better bz2 importer. The current bz2 importer uses Apache Commons which turns out to be fairly slow.

    To test it, do the following:

    • Go to Help:Options/Import
    • Check "Import bz2 by stdout"
    • Make sure "Import bz2 by stdout process" looks correct (they should be)
    • (Optional for Help:Import/List) Change "Custom wiki commands" to "wiki.download,wiki.import"
    • Go to Help:Import/Script and pick your downloaded dump
    • Change "uncompresss" to "read from compressed dump".
    • Click Generate script. You should see something like the below. You want to make sure the last argument is "" not "unzip"
    • Click Run script

    Let me know if you run into issues. I tested it on both Windows and Linux, so it should be fine.

    // import wiki from dump file
    app.setup.cmds.cmd_add("wiki.dump_file", "C:\xowa\wiki\#dump\done\simplewiki-latest-pages-articles.xml.bz2", "simple.wikipedia.org", "");
    
     
  • Anselm D

    Anselm D - 2014-02-17

    Thank you, it works now.

    testing with: 9,77 GB (10.492.810.412 Bytes)
    enwiki-20140102-pages-articles.xml.bz2

    At this moment 30% are done.
    task manager says: CPU Time for
    javaw.exe 1h 42 minutes
    7za.exe 40 minutes

    Strangely, it's still slower than unzipping and importing. I suppose it might be b/c of extra signalling / marshalling between 7z and Java.

    Do you use BufferedInputStream?

     
    • gnosygnu

      gnosygnu - 2014-02-18

      Do you use BufferedInputStream?

      Nope. I just used a standard InputStream. I'm doing stream.read(byte[], pos, len) where len is a large number (about 8 MB). I believe BufferedInputStream only makes a difference if I'm retrieving less than 8 KB (or some other value).

      I tried with new BufferedInputStream(stream, 8 MB), but this made no difference in speed. (still areound 85 seconds)

       
      Last edit: gnosygnu 2014-02-18
      • Anselm D

        Anselm D - 2014-02-19

        I tried with new BufferedInputStream(stream, 8 MB), but this made no difference in speed. (still areound 85 seconds)

        It was worth a try.

        Nope. I just used a standard InputStream. I'm doing stream.read(byte[], pos, len) where len is a large number (about 8 MB). I believe BufferedInputStream only makes a difference if I'm retrieving less than 8 KB (or some other value).

        Ok, i think this is what BufferedInputStream basically does and some overhead to manage the buffer.

         
        • gnosygnu

          gnosygnu - 2014-02-20

          It was worth a try.

          Yup. No harm in asking.

           
  • Anselm D

    Anselm D - 2014-02-17

    Windows XP
    Intel Pentium 4 530J
    3000.0 MHz

    20140217_191812.250 bldr done: 2m 9s 625f
    20140217_191812.250
    20140217_191812.250 wiki.init.bgn:simple.wikipedia.org

    read directly from getInputStream on Process
    20140217_185911.859 bldr done: 3m 32s 484f
    20140217_185911.875
    20140217_185911.875 wiki.init.bgn:simple.wikipedia.org

     
    Last edit: Anselm D 2014-02-17
  • Anselm D

    Anselm D - 2014-02-17

    My Ubuntu 64 Bit (with bzip2)
    Intel® Core™ i3-3220T CPU @ 2.80GHz × 4
    with SSD

    20140217_195809.079 bldr done: 48s 131f
    20140217_195809.081
    20140217_195809.082 wiki.init.bgn:simple.wikipedia.org

    read directly from getInputStream on Process
    20140217_200352.811 bldr done: 1m 5s 495f
    20140217_200352.812
    20140217_200352.813 wiki.init.bgn:simple.wikipedia.org

     
    Last edit: Anselm D 2014-02-17
  • gnosygnu

    gnosygnu - 2014-02-18

    Thanks for the stats!

    So it looks like Windows / 7z is 60% slower whereas Ubuntu / bzip2 is about 30% slower.

    Not great, but certainly much better than the current 300% via Apache Commons.

    I'm probably going to end up making this the default for v1.3.1. I've seen a number of complaints about the space required for unzipping the bz2 to xml. The Apache Commons alternative was meant to address it, but it was really too slow for me to make it the default. I think the stdout should be quick enough, while other users can still fall back on the unzip and import.

    Thanks again for the suggestion, let me know if there is anything else.

     
  • Anselm D

    Anselm D - 2014-02-19

    Windows XP (32-bit)
    Intel Pentium 4 530J
    3000.0 MHz

    testing "read directly from getInputStream on Process"

    using 7z from the "original" 7zip (9.30 alpha) 32-bit installation :
    7-Zip 9.13 beta Copyright (c) 1999-2010 Igor Pavlov 2010-04-15

    setting "Import bz2 by stdout process" to C:\Programme\7-Zip\7z

    20140219_111419.062 bldr done: 3m 22s 453f
    20140219_111419.078
    20140219_111419.078 wiki.init.bgn:simple.wikipedia.org
    20140219_111419.078 wiki.init.db_mgr

    Using bzip2 32-bit for windows:
    http://gnuwin32.sourceforge.net/packages/bzip2.htm

    setting "Import bz2 by stdout process" to C:\Programme\GnuWin32\bin\bzip2.exe
    -dkc "~{src}"

    20140219_113740.562 cmd end: import.sql.term 625f
    20140219_113740.562 bldr done: 6m 2s 703f
    20140219_113740.562
    20140219_113740.562 wiki.init.bgn:simple.wikipedia.org

    Ubuntu 64-Bit:
    using:
    7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
    p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

    (apt-get install p7zip p7zip-full)

    20140219_120441.504 bldr done: 47s 581f
    20140219_120441.505
    20140219_120441.506 wiki.init.bgn:simple.wikipedia.org

     
    Last edit: Anselm D 2014-02-19
  • gnosygnu

    gnosygnu - 2014-02-20

    That's interesting. It looks like different programs perform differently (with p7 being the best).

    This leads me to believe that the problem is probably program-side (7z / bzip2), and not Java buffering / marshalling.

    Thanks for the stats. Good to know.

     
  • Anselm D

    Anselm D - 2014-03-05

    Import German Wiki: Time with my "slower" computer with Windows xp:
    20140224_132453.375 bldr done: 1h 55m 17s 891f
    20140224_132453.390
    20140224_132453.500 wiki.init.bgn:de.wikipedia.org
    20140224_132453.500 wiki.init.db_mgr
    20140224_132453.500 wiki.init.lang
    20140224_132453.531 wiki.init.css
    20140224_132453.531 wiki.init.done
    20140224_132453.812 import.end de.wikipedia.org latest pages-articles

     
    Last edit: Anselm D 2014-03-05
  • Anselm D

    Anselm D - 2014-03-05

    Import German Wiki: Time with my faster computer with Ubuntu 64 Bit:
    20140225_225314.544 bldr done: 27m 5s 819f
    20140225_225314.544
    20140225_225314.545 wiki.init.bgn:de.wikipedia.org
    20140225_225314.551 wiki.init.db_mgr
    20140225_225314.553 wiki.init.lang
    20140225_225314.566 wiki.init.css
    20140225_225314.567 wiki.init.done
    20140225_225314.744 wiki.init.bgn:www.wikidata.org
    20140225_225314.744 wiki.init.db_mgr
    20140225_225314.746 wiki.init.lang
    20140225_225314.756 wiki.init.css
    20140225_225314.756 wiki.init.done
    20140225_225315.258 import.end de.wikipedia.org latest pages-articles

     
  • gnosygnu

    gnosygnu - 2014-03-06

    Thanks for the log detail.

    I'm surprised at the difference between the two machines. I'd expect that hardware specs are involved, in addition to the 7za vs bzip2 binaries. My Windows XP machine is 3.4 GHz and does a German Wiki import in about 50 min (I'll post a specific time later)

     
  • Anselm D

    Anselm D - 2014-03-06

    Windows XP 32 bit again:
    I started Xowa manually with the jdk and the --server opion (which is not included in my jre 32 bit java).

    20140306_145139.452 bldr done: 1h 35m 34s 922f
    20140306_145139.452
    20140306_145139.530 wiki.init.bgn:de.wikipedia.org
    20140306_145139.546 wiki.init.db_mgr
    20140306_145139.546 wiki.init.lang
    20140306_145139.577 wiki.init.css
    20140306_145139.577 wiki.init.done
    20140306_145140.593 import.end de.wikipedia.org latest pages-articles
    20140306_145142.874 wiki.init.bgn:www.wikidata.org

     
    • gnosygnu

      gnosygnu - 2014-03-07

      Ok. I ran mine without the --server option. About 39m for the dump_mgr -- which is where all the import by bz2 work would take place. See below

      My machine specs are a bit high-end (3.4 GHz), so maybe that's where the difference is?

      20140307_011313.953 wiki.init.bgn:de.wikipedia.org
      20140307_015225.109 cmd end: dump_mgr 39m 8s 63f
      20140307_015438.171 bldr done: 41m 22s 156f
      
       
      • Anselm D

        Anselm D - 2014-03-08

        My machine is old, see it has 3.0 GHz but it also depends on L1,L2,L3 cache, memory, harddisk, etc. I attached a screenshot from hwinfo32.

         
        Last edit: Anselm D 2014-03-08
        • gnosygnu

          gnosygnu - 2014-03-09

          Ok. Thanks for the screenshots / specs. Good to know.

           
        • Comment has been marked as spam. 
          Undo

          You can see all pending comments posted by this user  here

          Anonymous - 2014-03-09

          My Ubunto 64-Bit machine is a:
          Intel® Core™ i3-3220T CPU @ 2.80GHz × 4
          with SSD and i installed xowa at the SSD.
          -- Anselm

           
          • gnosygnu

            gnosygnu - 2014-03-10

            Ok. Thanks again for the stats.

            The SSD explains why you're so fast (faster than me). I'm on a 5400 RPM drive.

             
            • Comment has been marked as spam. 
              Undo

              You can see all pending comments posted by this user  here

              Anonymous - 2014-03-10

              Ubuntu 64-Bit
              OS and Java at SSD
              Xowa and dewiki-latest-pages-articles.xml.bz2 at (slow) USB 2.0 NTFS Harddisk (350 GB)
              20140310_134528.632 bldr done: 32m 49s 392f

              -- Anselm

               
1 2 3 4 > >> (Page 1 of 4)


Anonymous

Cancel  Add attachments





Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks