Offline thumbnail databases: Progress and request thread

gnosygnu
2013-11-11
2015-05-05
<< < 1 2 3 (Page 3 of 3)
  • Anselm D

    Anselm D - 2014-04-12

    Oops. No spell check in browser.

    ;-)
    Norweigian->Norwegian

     
    • gnosygnu

      gnosygnu - 2014-04-13

      Ughh.... Thanks for the catch. Fixed it now.

       
  • and0r

    and0r - 2014-05-15

    Images roughly through February and April 2nd 2014 do not appear in their corresponding articles for me. I suspect that one of the latest enwiki image updates may be broken. But am unsure if either local issue or perhaps snapshot state of wiki image dumps being a month or so behind their listed date. Quite sure I did everything right.

    Using 1.5.1.1(x64)
    The most recent enwiki image update i have installed is for April 2nd 2014
    and the latest enwiki article dump is for May 2014.

     
    Last edit: and0r 2014-05-15
  • gnosygnu

    gnosygnu - 2014-05-15

    Thanks for all the details. (Also, I deleted the other redundant post)

    When you get a chance, can you give me one or two examples of an article and a missing image?

    Also, you're using the 2014-05 dump. I'm planning to release the update this Monday, and it has 70,000 images. I find that a fair number of articles are missing images when you use a later dump (2014-05) with an earlier image set (2014-04). English Wikipedia has quite a bit of churn / turnover for images -- even with well-established articles (for example, en.w:Earth changes almost every month).

    In other words, the missing images may be part of this Monday's 2014-05 update. Again, an example would be illustrative...

    Some other thoughts:

    • Have you installed all the sets listed at https://archive.org/details/Xowa_enwiki_2014-04-02_images_update?
    • Have you installed them in chronological order? 2013-11 -> 2013-12 -> 2014-01 -> etc
    • In \xowa\file\en.wikipedia.org\fsdb.main, do you have fsdb.bin files numbered from 0000 - 0091?
    • Why do you think only the February - April images are missing?
    • I haven't made any changes to the images code in some time. (so v1.5.1.1 should be fine)
    • When I release the updates, I review them (50+ pages of 50+ images). This is in addition to normal day-to-day usage. If there were swaths of missing images in the 2014-04 set, I didn't spot them.

    Hope this helps.

     
  • and0r

    and0r - 2014-05-18

    The "other redundant post" was done as anonymous and never appeard on site. I then specifically made an account just to speak with you.

    Have you installed all the sets listed at https://archive.org/details/Xowa_enwiki_2014-04-02_images_update?

    yes

    Have you installed them in chronological order?

    yes

    Why do you think only the February - April images are missing?

    Because they are!
    lol actually a simple method I use for testing images is to browse "Wikipedia:Recent_additions" in Xowa (offline, adapter disabled).
    There is a list of preview images that span for the selected month. They do not load for the selected month (March for sure), I've manually checked each articles creation date and also directly checked most articles for the month, all with no results.

    I also double hash-check all my files in torrent client just to be sure.
    Anyway, great app! I've never seen anything useful written in Java before (that was a java joke ;)
    Please also include a method of browsing random articles similar to Special:Random function on wikipedia so that I don't kill wikipedia servers all day!

    There are absolutely no images listed for entire month of March, all the way up into the specific day of April the 2nd. I can tell you that for sure.

    BEST APP EVER

     
    Last edit: and0r 2014-05-19
  • gnosygnu

    gnosygnu - 2014-05-19

    Hi! Thanks for the follow-up. I reply selectively below:

    The "other redundant post" was done as anonymous and never appeard on site.

    Understood. I just wanted to let you know I deleted it, in case you expected to see it. (By the way, thanks for taking the time to make an account).

    is to browse "en.wikipedia.org/wiki/Wikipedia:Recent_additions"

    Nice! I didn't know about that page before, and will definitely use it to in the future as well.

    They do not load for the selected month (March for sure)

    Ok. Thanks for the example. I checked w:Wikipedia:Recent_additions/2014/March and most of the images don't load with the 2014-05-02 dump. However, about 1/3 did. For example, the following were the first 3 images on the page that loaded (everything else up until then did not load)

    • 2014-03-30: Bremen Cotton Exchange building has glass mosaics (example pictured)
    • 2014-03-24: that teenager Rywka Lipszyc's diary of her life in the Łódź Ghetto (pictured)
    • 2014-03-23: that Charlotte von Kalb (pictured)

    In case you're interested, I think these images were missing because of the edit date.

    Basically, I use a commons dump to generate the images, so if it doesn't make the cutoff, it goes into the following month's update.

    At any rate, these images are there with the 2014-05-02 update. I checked the page now, and all images show. I also included a hash file of all the update files if you want to check: https://archive.org/details/Xowa_enwiki_2014-05-02_images_update

    Anyway, great app! I've never seen anything useful written in Java before (that was a java joke ;)

    ;)
    Though in all due seriousness, Java and the SWT Web Browser really did make this much easier. Using Firefox / XUL Runner programatically in Java was one of the nicest surprises in developing XOWA.

    Please also include a method of browsing random articles similar to Special:Random function

    Actually, I broke the sidebar by mistake for v1.5.2.2. It's fixed in v1.5.3.1: https://sourceforge.net/projects/xowa/files/v1.5.3/. You'll see a "Random article" in the sidebar, just like Wikipedia.

    Also, you can always do en.wikipedia.org/wiki/Special:Random as well as Ctrl+Shift+R. (neither of these broke in v1.5.2)

    BEST APP EVER

    Thanks! I really do appreciate the praise!

     
  • Anonymous - 2014-06-17

    Hey gnosygnu,

    it's the guy who's simple couldn't-run-xul_runner.exe-because-of-Windows-reinstall issue you patiently worked out a few days ago. Sorry, to lazy to make an account.

    Just to update, everything went slick as frog spit after your help and I even went ahead and got the ~80 gigs of thumbnails with relative ease.

    Now my (hopefully simple) question is regarding the following response you gave the guy here asking a question I had...

    The short answer is that your setup is fine. The image databases only contain thumbnails. >More work will be involved to get the full image.

    A longer answer is that you'd need to do the following:

    *Setup commons.wikimedia.org (this will handle the "could not find file" issue)...
    For "an incremental set" (saving images one at a time), just navigate to a commons page >and click on the image. XOWA will download the file and store it in the fsdb.user >databases: D:\xowa\file\en.wikipedia.org\fsdb.user

    So...

    1. How do I "setup commons.wikimedia.org"? I promise I attempted to research this one and I came to the conclusion that I'm kind of slow at anything non-GUI. And...

    2. If I have my main xowa folder on a separate partition (same HDD)* from the one I have my internet browser (firefox) on, will this prevent the "...click on the image. XOWA will download the file and store it in the fsdb.user databases: D:\xowa\file\en.wikipedia.org\fsdb.user " step from being possible? If so, is there an easy modification that still allows for my wiki partition? (I guess maybe installing a separate internet browser in that partition?)

    3. Also, just a shot in the dark, but are there 'catagorical/subject-matter' links to full images (ie. just 'math' or 'science') that might be closer to 50-100 GB that I could download rather than the extreme entire ~2.2 TB set or inconvenient individual images? (I'm in engineering school so that would be a help for larger pics of just that stuff, but I'm not getting my hopes up for this one)

    *I partitioned my HDD thinking that I can safely/easily transfer all my isolated mass wiki info in case I get a bad virus. Was this a bad idea for any reason in particular?

     
  • gnosygnu

    gnosygnu - 2014-06-18

    Responding selectively below. Let me know if I missed anything.

    Sorry, to lazy to make an account.

    No worries. Anonymity is perfectly fine.

    Just to update, everything went slick as frog spit after your help and I even went ahead and got the ~80 gigs of thumbnails with relative ease.

    Cool. Glad to hear that downloading all those files was easy.

    How do I "setup commons.wikimedia.org"?

    Again, no worries. I find this is a frequent point of confusion.

    commons.wikimedia.org is a wiki just like any other. You can set it up by doing the following:

    • On the Main Menu: Tools -> Import From List
    • At home/wiki/Help:Import/List, click on the "latest" link next to "commons.wikimedia.org". This will setup the wiki, just like the others

    For more info on Commons, look at home/wiki/Help:Wikis/Commons

    If I have my main xowa folder on a separate partition (same HDD)* from the one I have my internet browser (firefox) on, will this prevent the "...click on the image.

    Nope. XOWA has no dependency on the internet browser location, especially for clicking on an image.

    So, as an example:

    • You've setup the following two wikis from Help:Import/List: en.wikipedia.org and commons.wikimedia.org
    • In XOWA, you navigate to en.wikipedia.org/wiki/Earth
    • You click on the image in the Infobox: "A photomosaic of Earth..."
    • XOWA navigates you to en.wikipedia.org/wiki/File:Earth_Eastern_Hemisphere.jpg (this is really commons.wikimedia.org).
    • You'll see a larger image of File:Earth Eastern Hemisphere.jpg, but still not the original
    • You click on this image.
    • XOWA downloads the original image and stores it in your /file/en.wikipedia.org/fsdb.user databases

    In all the steps above, no internet browser is used (Firefox, Chrome, Internet Explorer). Of course you will need an internet connection for downloading File:Earth_Eastern_Hemisphere.jpg

    Also, just a shot in the dark, but are there 'catagorical/subject-matter' links to full images (ie. just 'math' or 'science') that might be closer to 50-100 GB that I could download rather than the extreme entire ~2.2 TB set or inconvenient individual images?

    No, unfortunately not. I'm still waiting on wikimedia / your.org to update the 2.2 TB set. Without even a full, up-to-date set, there's really no point in me / anyone else trying to do breakdown by categories.

    You can look at the newsgroup link / bugzilla link from last year. Feel free to add yourself to the bugzilla cc list, or send an email to the newsgroup about status.

    http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-July/000861.html
    https://bugzilla.wikimedia.org/show_bug.cgi?id=51001

    I partitioned my HDD thinking that I can safely/easily transfer all my isolated mass wiki info in case I get a bad virus. Was this a bad idea for any reason in particular?

    Hmmm... You can try putting it all on a 128 GB SD card, though it will be very tight with commons.wikimedia.org.

    Otherwise, a separate partition is fine, though it's sometimes a hassle to manage the space on the two partitions.

    Hope this helps.

     
  • Anonymous - 2014-06-18

    Nope no more questions. Thanks again. I'm loving this!

     
  • gnosygnu

    gnosygnu - 2014-06-18

    Cool! Thanks for the follow-up!

    Also, just as an FYI, I'm working on a pretty neat feature for v1.6.4: Hovercards. See https://www.mediawiki.org/wiki/Extension:Hovercards .

    v1.6.4 will only support text, but IMHO the early prototype in XOWA is pretty useful.

     
  • Anonymous - 2014-07-19

    (This actually has nothing to do with thumbnail dbs, but:) Do you plan to release Wikidata in XOWA's db format? My copy is really old (it predates the SQL format), but when I wanted to update it, I had to realize that building it will take either MUCH space or MUCH time, depending on whether I unzip the .xml.bz2 file or not. As Wikidata becomes more important to install as more wikis use its data, I'm probably not the only one who would want to download a ready-to-use database. How big is it? For other wikis the size is a bit more than twice the size of the .xml.bz2 file, which would be 5.5 GB, but Wikidata might be different. --Schnark

     
  • gnosygnu

    gnosygnu - 2014-07-19

    Hey... The short answer is I can upload a 6.5 GB compressed copy to archive.org if you want. However, I really think you should try to import it. The import no longer consumes extra temp space, and at "one-click" it should be just as easy as downloading from archive.org.

    I reply further below.


    Do you plan to release Wikidata in XOWA's db format?

    Honestly, I'm not sure. I've been uploading wikis for the smaller languages (Latin; Greek; Chinese) because, well, they're small. However, I've stayed away from uploading them for the larger languages (English; German; French). Wikidata falls in the same category. I don't want to use archive.org disk space unless necessary. For the larger wikis, it's harder for me to justify uploading the XOWA db, since it can be generated directly from the MediaWiki dumps.

    That said, I would like to provide some way for users to download the finished databases. I had hoped it would be torrents, but so far, it looks like the number of seeders are very low.

    My copy is really old (it predates the SQL format)

    Wow, I wouldn't have guessed that.

    it will take either MUCH space or MUCH time, depending on whether I unzip the .xml.bz2 file or not.

    Actually, the MUCH space should be a thing of the past. Anselm's suggestion to use bz2 by stdout works fairly well. The imports use no extra space (nothing is unzipped), and is slower by "only" about 25%. It's enabled by default, but you can check home/wiki/Help:Options/Import.

    That said, Wikidata still takes about two hours to process. Keep in mind it is one-click and always completes reliably. I've been keeping up to date with wikidata on a bi-weekly basis for close to a year now.

    As Wikidata becomes more important to install as more wikis use its data, I'm probably not the only one who would want to download a ready-to-use database.

    I agree. The only issue is that the XOWA db is about 6.5 GB. And that's compressed. So downloading 6.5 GB and unzipping may not be much faster than downloading 2.3 GB from MW and unzipping.

    I have no problems putting this as a torrent on my hard drive. I do have some reservations about uploading it to archive.org and using their disk space...

    How big is it?

    The current uncompressed version weighs in at 13.1 GB. With 7z compression, it reduces to about 6.5 GB.

    For other wikis the size is a bit more than twice the size of the .xml.bz2 file, which would be 5.5 GB, but Wikidata might be different.

    Yeah, the json compresses much better than the wikitext, particularly since most of the json labels are the same. This is probably why XOWA is 6.5 GB compressed versus MediaWiki's 2.3 GB. XOWA compresses on a row by row level, so it can't "save" as much as the MediaWiki dump file (which compresses these labels once across the entire dump)

     
  • Anonymous - 2014-10-08

    Hi,

    I found your application only recently after having downloaded the en-wikipedia and de-wikipedia dump and realized wikitaxi doesn't deal with the new {{#invoke directives

    Is there any possibility to reuse the dump file with xowa or does xowa have to download it again?

     
  • gnosygnu

    gnosygnu - 2014-10-08

    Hi! Yes, you can reuse the dump file in XOWA. Here are the steps:

    • Use the Main Menu and do Tools -> Import from Script (or go to home/wiki/Help:Import/Script)
    • Change "Wiki" from "Simple English" to "English"
    • Change "Where to get the dump" from "download" to "read from file"
    • In the next box, click "..." and select the file
    • Click "Import now"

    Let me know if you run into issues. Thanks!

     
  • Anonymous - 2014-10-09

    Brilliant! I'll try it this evening for en-wiki.
    Great program, really enjoy being able to browse offline.
    Unfortunately, I already deleted the de-wiki dump before realizing wikitaxi has a problem with it :-(

    PS: I'm running xowa on Windows with bigger screen fonts. Just noticed that the status line is cut off at about half height. Just for info, no big deal, lots of
    programs have that. Don't know if the Java API lets you query size of fonts

    PPS: The menu item "Import from Script" is a bit misleading/deterrent, maybe it would be more user-friendly to rename the items "import online" and "import offline"?

     
  • gnosygnu

    gnosygnu - 2014-10-09

    Brilliant! I'll try it this evening for en-wiki.

    Cool. Let me know if you run into issues.

    PS: I'm running xowa on Windows with bigger screen fonts. Just noticed that the status line is cut off at about half height.

    Yeah, this is an issue I need to handle automatically. You can fix it manually now by doing the following:

    • Go to home/wiki/Help:Options/Window
    • Change "Adjustment type" to "relative"
    • Change "Adjustment rect" to "0,0,-6,-28". The -6 will make the main html box 6 pixels thinner and the -28 will make it 28 pixels shorter. The latter should bring the status bar on screen. If my numbers are off, you can play with the numbers.

    maybe it would be more user-friendly to rename the items "import online" and "import offline"?

    Brilliant idea! I'll change it for this week's release. Thanks!

     
  • Anonymous - 2015-02-15

    i have downloaded the Xowa_commonswiki_2014-01-23_images_complete_1.7z file via torrent. Now i could not integrate it with the XOWA application. Please help me.

     
  • gnosygnu

    gnosygnu - 2015-02-15

    You should unzip it directly to C:\xowa (or wherever you installed xowa). This will unzip images to C:\xowa\file\commons.wikimedia.org\wiki . Note that this will only work for files in the Main namespace (commons.wikimedia.org/wiki/Berlin) so it's usefulness is limited. It was created per someone else's request.

    If you're still having issues, let me know the following:
    - where XOWA is installed (C:\xowa)
    - where the files have been unzipped to
    - what page you're visiting.

    Thanks!

     
  • Anonymous - 2015-02-16

    i have downloaded the image update files Xowa_enwiki_2014-08-11_images_update.7z,Xowa_enwiki_2014-11-06_images_update.7z,Xowa_enwiki_2014-09-03_images_update.7z. but the zip files could not be extracted giving error "unable to create file". but the archive Xowa_enwiki_2015-01-12_images_update.7z is being extracted correctly. please help.

     
  • gnosygnu

    gnosygnu - 2015-02-17

    Hi! Thanks for the detail. I just downloaded Xowa_enwiki_2014-09-03_images_update.7z (https://archive.org/download/Xowa_enwiki_latest/Xowa_enwiki_2014-09-03_images_update.7z) and it unzipped correctly. I'm going to guess that you had a bad session. Can you try redownloading the 2014-09-03 version again? If it still fails, I'll generate the md5s on my side so you can cross-check.

     
    • Anonymous - 2015-02-17

      thanks. yes, it was a problem with the system.

       
  • gnosygnu

    gnosygnu - 2015-02-17

    Cool. Thanks for the follow-up

     
<< < 1 2 3 (Page 3 of 3)


Anonymous

Cancel  Add attachments





Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks