Menu

#223 Offline images (stage 4): Create image database for simplewiki

v0.10.*
closed
nobody
v0.10.3
2013-11-04
2013-09-22
gnosygnu
No

For stage 3, see this ticket

Stage 4 involves downloading all 80,000 images into XOWA's database scheme. I will probably have to tweak a good deal of stages 1 - 3 in working on this ticket.

When finished, I should have an estimated 5 sqlite databases of 200 MB each that represents all the thumbs in simplewiki.

Discussion

  • gnosygnu

    gnosygnu - 2013-10-03

    So, I have a rough version of a file database for simplewiki. Some details:

    • There are 89 thousand images
    • The total database size for all these thumbs is about 1.6 GB
      • Note that the tarball size (the original files) is about 57 GB. So it looks like a saving of 35x.
      • If English Wikipedia scales out the same way, then that would mean that 2.2 TB of originals would be 62 GB. I think this is probably wrong, as my earlier estimates were at 120 GB. However it does make me feel better that it will not be 200 GB.
    • Somewhere between 2% and 3% of images are missing. This is not really noticeable, but it is statistically significant. The major cause of the missing images is the data discrepancy betweeen commons (2013-09-10) and simplewiki (2013-09-24). There are some other issues that are not worth detailing here.

    At this point, I'm going to ask for volunteers to test simplewiki and other smaller wikis. I'm planning to put simplewiki on Google Drive, or maybe some other file locker service. I'll also include a formal notice in v0.10.0 asking for help.

    So with that in mind:

    • Let me know if you're interested in testing simplewiki. A simple comment below would help me gauge if there is interest in simplewiki.
    • If you're not interested in simplewiki, but would like to test another wiki instead, let me know which one. If it's small enough (i.e.: not one of the top 10 wikipedias), I will create a version for it. For example, German Wikibooks or English Wikivoyage.

    Some caveats:

    • There will be no backward compatibility. I will change the database / XOWA so that the database will not work with a future version of XOWA (or vice versa). You will have to download a new version of simplewiki. If downloading 1.6 GB is painful, then you should hold off.
    • You will need commonswiki installed (and some satellite databases). This will be another 16 GB of space.
    • The current implementation is very preliminary. The main purpose is to be able to get thumbs for an article without going to the internet. There is a lot of optimization that can be done (in time and disk space).
     
  • Anonymous

    Anonymous - 2013-10-05

    I am happy to help test Simplewiki thumbnails.
    To remind you, a) I am limited to an hour on the local library service and b) for some reason I am unable to run Xowa online at the library.
    I don’t know Google Drive, but I hope that it will enable me a) to use a downloader (I use OrbitDownloader) so that I can break off the download if I run out of time and resume next time, and b) to download the thumbnails without needing Xowa..
    -JosephWebber

     
  • Anonymous

    Anonymous - 2013-10-05

    I am also ready but have a few limitations, some similar to what JosephWebber pointed out, above due to lack of internet access. If you can split the 1.6gb into two parts then it will make it easier for me to downbload. Thanks.

    hidp123

     
  • gnosygnu

    gnosygnu - 2013-10-07

    Thanks for the replies. I really appreciate the offers.

    I had some personal duties this weekend, and wasn't able to get simplewiki support into v0.10.0. I'll work on this for v0.10.1 and add more info to this thread.

    Sorry about the lack of follow-through on my side.

     
  • gnosygnu

    gnosygnu - 2013-10-14

    v0.10.1 has experimental support for offline thumbnail databases. See http://sourceforge.net/p/xowa/discussion/general/thread/8d325449 for more info.

    I'll use this thread for more technical details.

     
  • gnosygnu

    gnosygnu - 2013-10-28

    I'm going to mark this item done for now. I got independent validation from 3 users that the offline thumb dbs work. I'm going to turn my attention to the larger wikis, as they have their own challenges

     
  • gnosygnu

    gnosygnu - 2013-10-28
    • status: in-progress --> done
     
  • gnosygnu

    gnosygnu - 2013-11-04
    • status: done --> closed
     

Anonymous
Anonymous

Add attachments
Cancel