Offline thumbnail databases: Progress and request thread

gnosygnu
2013-11-11
2014-07-19
1 2 3 > >> (Page 1 of 3)
  • gnosygnu
    gnosygnu
    2013-11-11

    This thread will track my progress in creating the offline thumbnail databases.

    I've generated all the image databases for wikis that are listed as "More than 200,000 articles" on http://en.wikipedia.org. In addition, I've added a few wikis from the "More than 50,000 articles". I provide the full list below.

    Some notes concerning requests:

    • If you want a wikipedia that's not on the list, you can request it in this thread, and I will work on it within the coming week or the next.

    • You do not need to request the sister wikis for your language. (wiktionary, wikisource, wikivoyage). I will generate them along with the main wikipedia.

    • Currently, monthly updates are only being done for English Wikipedia. German, Polish and Chinese are done on a 3 month basis, while others fall closer to 4 or 5 months. I am still trying to set up an automated system for generating updates. If you want an update for your wiki, please make a request below. Otherwise, it may not come for several months


    Links

    Important pages you should probably visit:


    Current
    * 2014-10-06: Thai wikis (update) and Slovenian wikis (new)


    History

    • 2013-10-14: Simple Wikipedia is done. See Setup Simple Wiki
    • 2013-11-25: English Wikipedia is done. Total size is 74.5 GB. See Setup English Wiki
    • 2013-11-26 English Wiktionary, Wikiquote, Wikivoyage and Wikispecies are done.
    • 2013-12-09: German Wikipedia is done. Total size is 32 GB. See Setup Other Wikipedias
    • 2013-12-09: German Wiktionary, Wikivoyage and Wikiquote are done.
    • 2013-12-16: English Wikipedia update (2013-12-02). All Portal images also included
    • 2013-12-22 English Wikisource, Wikibooks, and Wikiversity are done.
    • 2013-12-30: French Wikipedia, Wiktionary, Wikivoyage, Wikiquote, Wikibooks and Wikiversity are done.
    • 2014-01-06: Polish Wikipedia, Wiktionary, Wikivoyage, Wikiquote, Wikibooks and Wikinews are done.
    • 2014-01-13: English Wikipedia update (2014-01-02); Chinese Wikipedia, Wiktionary, Wikiquote, Wikibooks and Wikinews are done.
    • 2014-01-20: German Wiktionary (with audio) (2014-01-04), German Wikibooks (2014-01-15), German Wikiversity (2014-01-11), German Wikinews (2014-01-13), French Wikinews (2014-01-11), and English Wikinews (2014-01-16).
    • 2014-01-27: English, German, French, Polish, Chinese Wikisource
    • 2014-02-03: Dutch, Latin wikis; Wikimedia Common (Mainspace only)
    • 2014-02-10: Italian wikis
    • 2014-02-17: Spanish wikis; English Wikipedia update
    • 2014-02-24: Russian wikis; Greek wikis; Chinese Wikivoyage; German Wikipedia update
    • 2014-03-03: Swedish wikis
    • 2014-03-10: Japanese wikis
    • 2014-03-17: Ukrainian wikis
    • 2014-03-24: Arabic wikis
    • 2014-03-31: Hungarian wikis
    • 2014-04-07: Thai wikis; Portuguese wikis
    • 2014-04-14: Norwegian wikis
    • 2014-04-21: English wiki update; Catalan wiki
    • 2014-04-28: Vietnamese wikis
    • 2014-05-05: Finnish wikis; Chinese wikis update
    • 2014-05-12: Polish wikis update; Czech wikis
    • 2014-05-19: English wikis update; Simple wiki update; Species Wiki update
    • 2014-05-26: French (update); Korean (new); Bengali (new);
    • 2014-06-02: Persian (new); German (update)
    • 2014-06-09: Turkish (new); Dutch (update)
    • 2014-06-16: Indonesian (new); Italian (update)
    • 2014-06-23: English (update); Latin (update)
    • 2014-06-30: Romanian (new); Spanish (update)
    • 2014-07-07: Serbian (new); Russian (update) (or Swedish)
    • 2014-07-14: Malaysian (new); Swedish (update)
    • 2014-07-21: Hebrew (new); English (update)
    • 2014-07-28: English (rebuild), Ukrainian (update) and Greek (update)
    • 2014-08-04: Japanese (update) and Bulgarian (new)
    • 2014-08-11: German (rebuild) and Danish (new)
    • 2014-08-18: French (rebuild) and Polish (rebuild)
    • 2014-08-25: English wiki (images) and Arabic (rebuild)
    • 2014-09-01: Chinese (rebuild) and Serbo-Croatian (new)
    • 2014-09-08: English sister wikis (update) and Croatian wikis (new)
    • 2014-09-15: Hungarian wikis (update) and Esperanto wikis (new)
    • 2014-09-23: Portuguese wikis (update) and Slovak wikis (new)
    • 2014-09-30: English Wikipedia (update) and Waray-Waray Wikipedia (new)

    List of completed wikis
    
    simple.wikipedia.org
    de.wikipedia.org
    fr.wikipedia.org
    pl.wikipedia.org
    zh.wikipedia.org
    nl.wikipedia.org
    la.wikipedia.org
    it.wikipedia.org
    es.wikipedia.org
    ru.wikipedia.org
    el.wikipedia.org
    sv.wikipedia.org
    ja.wikipedia.org
    uk.wikipedia.org
    ar.wikipedia.org
    hu.wikipedia.org
    th.wikipedia.org
    pt.wikipedia.org
    no.wikipedia.org
    ca.wikipedia.org
    vi.wikipedia.org
    fi.wikipedia.org
    cs.wikipedia.org
    ko.wikipedia.org
    bn.wikipedia.org
    fa.wikipedia.org
    tr.wikipedia.org
    id.wikipedia.org
    ro.wikipedia.org
    sr.wikipedia.org
    ms.wikipedia.org
    he.wikipedia.org
    bg.wikipedia.org
    da.wikipedia.org
    sh.wikipedia.org
    hr.wikipedia.org
    eo.wikipedia.org
    sk.wikipedia.org
    war.wikipedia.org
    
    sl.wikipedia.org
    ceb.wikipedia.org
    gl.wikipedia.org
    lt.wikipedia.org
    ms.wikipedia.org
    et.wikipedia.org
    mk.wikipedia.org
    
     
    Last edit: gnosygnu 3 days ago
  • Anselm D
    Anselm D
    2013-11-12

    The Main Page inside Xowa said: Help Wanted!
    I can help seeding the torrents.

    I like to have the German Wikipedia too.
    I would like to start testing with the smaller German image wikis:
    dewikibooks
    dewikinews
    dewikiquote
    dewikisource
    dewikiversity
    dewikivoyage
    dewiktionary

    Is the "help wanted information" displayed inside xowa anywhere at the www? To Blog / post / tweet about XOWA, it can be usefull to give the people an URL to this information.

     
  • gnosygnu
    gnosygnu
    2013-11-13

    I can help seeding the torrents.

    Great! Thanks for replying, as well as offering help with seeding!

    As the image dbs complete in the coming weeks, I'll post more instructions / links above. I'll try to start seeding simplewiki this weekend.

    I like to have the German Wikipedia too.

    German Wikipedia will be next after English Wikipedia. Barring any problems, I'm targeting 2013-12-09.

    I would like to start testing with the smaller German image wikis:

    Fair enough. If these are quick, I'll try to squeeze them out with dewiki. If not, do you have any priority for them (in case I can only get to two or three of them)? The list above seems to be alphabetical.

    Is the "help wanted information" displayed inside xowa anywhere at the www?

    Sadly, no, especially as I don't have a website. I'm going to do more publicizing when I have full sets, but I really wanted to get as much help as I could from any current users (who use XOWA and know what it can do).

    Thanks!

     
    • Anselm D
      Anselm D
      2013-11-13

      I'll try to start seeding simplewiki this weekend.

      If you are ready with it, tell me, i will try to help you seeding.

      Fair enough. If these are quick, I'll try to squeeze them out with dewiki. If not, do you have any priority for them (in case I can only get to two or three of them)? The list above seems to be alphabetical.

      If you do the German Wikipedia first, there is no hurry for the other German Wikis. I would like to have them for my father, he is offline.

      Thank you for xowa!
      Anselm

       
      • gnosygnu
        gnosygnu
        2013-11-14

        If you are ready with it, tell me, i will try to help you seeding.

        I'll put together something after this weekend and post here.

        If you do the German Wikipedia first, there is no hurry for the other German Wikis

        Ok. German Wikipedia will be first and should come out around 12-09. I'll try to add dewikivoyage and another "easy" wiki if possible. Otherwise, the rest should be available the following month (barring any issues)

        Thanks again for the feedback!

         
      • gnosygnu
        gnosygnu
        2013-11-18

        FYI: I added the torrent to the v0.11.2 build. I'm also attaching it below.

        Let me know if you get a chance to try the torrent. I don't need this to be seeded, but I'd just like independent confirmation that it works (i.e.: it brings down a full set).

        Thanks.

         
        • Anselm D
          Anselm D
          2013-11-18

          Just started, tracker is available but the client can not find a peer. A very short time there was a peer and downloaded started, now it is gone. Now it is working (speed 90-120 kB/s)... and now the peer is gone again.

           
          Last edit: Anselm D 2013-11-18
          • gnosygnu
            gnosygnu
            2013-11-18

            I see your connections coming in and out. I'm not sure why they fade to zero. I don't know if this is an issue with qBittorrent (I assume my network settings are correct since you are getting through).

            I'll keep the computer up and look at this more tomorrow (it's late here). I'm glad that you're at least able to get some bytes though.

            Thanks!

             
            • Anselm D
              Anselm D
              2013-11-18

              Now it seems to be stable. If you do not pack it, maybe you should publish checksum files?
              Good night, i go to work now.

               
              • gnosygnu
                gnosygnu
                2013-11-18

                Good idea. I'll add md5s to the next set.

                 
              • gnosygnu
                gnosygnu
                2013-11-18

                FYI: I see that there are now 2 seeds, so it looks like the transfer completed.

                Thanks!

                 

  • Anonymous
    2013-11-15

    I myself will probably not download an image database for de.wikipedia. But: Is it possible to create a database for only a few images which are used on many pages? A database that just includes flag icons at 20px (Wikipedia:Ländervorlagen mit Flagge, Wikipedia:Vorlagen subnationaler Einheiten mit Flagge, Wikipedia:Ländervorlagen ehemaliger Staaten und Flaggen, Wikipedia:Ländervorlagen Handels-, Dienst- und Kriegsflagge, Wikipedia:Vorlagen internationaler Organisationen mit Flagge) and locator maps at 240px (Special:Allpages/Vorlage:Positionskarte) (+ File:Red_pog.svg at 8px) should be small, but include most of the images I really miss offline. I could probably provide a page which uses all the proposed images, if that helps. --Schnark

     
    • gnosygnu
      gnosygnu
      2013-11-16

      But: Is it possible to create a database for only a few images which are used on many pages?

      Yup, it should be possible. After I generate a full set for dewiki, I can do a post-processing step to generate a smaller version. I think the code should be simple

      I could probably provide a page which uses all the proposed images, if that helps.

      No need. I actually have such a list. I've attached a csv list of 38,943 images below that are used on 4 or more pages. (the three columns are title,file_width,#_of_pages_used)

      I picked "4 or more" b/c it looks like approximately 45,000 images will fit in one GB. This is based on simplewiki's / enwiki's build so far.

      So, 38,943 will represent a lot of the reused images, and still be less than 1 GB. In fact, they will probably be much less than 1 GB (500 MB?) since these are probably small-sized images, and the 45,000 figure is based on an average-sized image.

      I provide a larger breakdown below for images on dewiki.

      2,059,702: total
      1,853,379: = 1
        206,323: > 1
         66,018: > 2
         38,943: > 3 
         29,372: > 4
         24,683: > 5
         21,642: > 6
         19,419: > 7
         17,613: > 8
         16,200: > 9
         14,983: > 10
      
       
      Attachments
    • gnosygnu
      gnosygnu
      2013-12-09

      I uploaded a copy of the "light" German Wikipedia image database to my Google drive: https://drive.google.com/file/d/0B9cb52zjL2rIY1lnNmppY0VVSG8/edit?usp=sharing

      Let me know if you have problems accessing it, and I'll upload it elsewhere.

      The file is about 545 MB in total and has 40k images. To use it, unzip the files to /xowa/file/de.wikipedia.org. When you're done, you should have a file called /xowa/file/de.wikipedia.org/fsdb.main/fsdb.abc.sqlite3

       
  • Anselm D
    Anselm D
    2013-11-18

    If i have different wikis (e.g. English and German) and i have an "offline thumbnail database" for each wiki. Do i have the intersection of the thumbnails twice?

     
  • gnosygnu
    gnosygnu
    2013-11-18

    Yes. Unfortunately, these image databases will be self-contained for each wiki.

    I will look into giving the user the ability to merge them in a future release, but it won't be there for a while.

    Thanks.

     
  • Anselm D
    Anselm D
    2013-11-26

    FYI:
    Yesterday i started to download the enwiki image as torrents. In all parts i can see other downloaders, but not in the ...complete_01. I did a test and add the torrent from your distribution again. My client says, "The torrent you are trying to add, is still in the list of torrents" (this is ok for me). But if i add the torrent from archive.org it is also added (and my client starts checking it, because the files are already there). I can see they have different hash values. From your distribtution it ends with F13E and the one i downloaded from archive.org (https://archive.org/download/Xowa_enwiki_2013-11-04_images_complete_01/Xowa_enwiki_2013-11-04_images_complete_01_archive.torrent) ends with 21CF.

     
    • gnosygnu
      gnosygnu
      2013-11-26

      Thanks for noticing.

      For now, use the one from archive.org. They are more up to-date

      I don't know why they're different, but I'm going to guess they changed b/c the meta-data files were updated. Unfortunately, someone from the WikiTeam pointed out that the licenses needed to be corrected: https://archive.org/details/Xowa_enwiki_2013-11-04_images_complete_00. I changed them all today at about 03:00 UTC. This may have rippled out and caused the issues with the torrents you're seeing.

      If this is the case, I don't know what else to recommend. I don't have much control over the matter, as it's all at archive.org...

       

  • Anonymous
    2013-11-27

    After sending you the message I did some exploring and came to the same conclusion! I use a download manager and it's very fast.
    One question: as you release the other thumbnail sets, presumably I can delete the corresponding tarballs. But are all the images in the thumbnail sets? If so, I assume that once I have thumbnails for all my wikis I can also delete commonswiki and save almost a terabyte of disk space in total. Is that correct?

     
    • gnosygnu
      gnosygnu
      2013-11-27

      If so, I assume that once I have thumbnails for all my wikis I can also delete commonswiki and save almost a terabyte of disk space in total. Is that correct?

      Note that the thumbnails are smaller versions of the original file. They generally have a lower width and less detail. If you want to see the original file (for example, by clicking on the thumbnail), then you will still need commonswiki.

      If you rarely click on a thumbnail, then you can definitely dump commonswiki. If you click on them somewhat frequently, then you may want to keep the tarball around. However, 1 TB is a lot (and it should be a lot more: English Wikipedia is 2.2 TB) so it's a balance you need to judge.

      Hope this helps.

       
  • Anselm D
    Anselm D
    2013-12-09

    At the main page of http://xowa.sourceforge.net/ you have list, which wikis you have done. I think you wants to emphasize, which image thumbs you prepared? The word image is only not mentioned in the 3rd one.

     
    • gnosygnu
      gnosygnu
      2013-12-09

      Ok. I changed the main page to include the word images. Let me know if you think it should say something else.

      Thanks.

       
  • Anselm D
    Anselm D
    2013-12-09

    In tthis discussions you have a link, which is broken (Setup Other Wikipedias):
    2013-12-09: German Wikipedia is done. Total size is 32 GB. See Setup Other Wikipedias

     
1 2 3 > >> (Page 1 of 3)


Anonymous


Cancel   Add attachments