Website archive

2005-01-11
2012-12-08
  • roberthoff82
    roberthoff82
    2005-01-11

    I downloaded about 8 GB from a website using WinHTTrack, a spider-downloader.  This is too big to have around and I know the data would compress down nicely.

    Here's the issue: It's a website, with links in it, pointing to relative URLs.  It would be nice to be able to browse the site without expanding the entire archive.  This would require an interface where each file in the archive could be identified by a URL, with the folder structure matching that of the archive.  When the browser requests a file, that file gets extracted and "served" to the browser.

    I get the impression from reading the forum that Igor's not real into fancy UI enhancements; this is by no means a feature request.  But I was wondering if the community might have any insights.

    I was thinking maybe an http server could do this (for my purposes, it would just be running locally), but I can't find any FREE software to do this, and I'm more willing to get my own hands dirty than to pay for "Remotezip".  Plus, then I couldn't reap the benefits of 7z compression. :)

    Thoughts?

     
    • Josh Harris
      Josh Harris
      2005-01-12

      Well, if you're willing to use gzip compression...Most modern web browsers can decompress gzip encoded text on the fly.  Meaning that a web server can send you a gzipped compressed html page (at about 1/10th it's original size) and your web browser will decompress it internally and display the oringal file to you.

      Basically, you need a web server (I have Apache v2.052 running locally on WinXP).  All of the html pages must be invidually compressed into gzip format and placed in the http server path.  You then modify a config file in Apache so that it will tell your web browser that it is sending a gzip encoded file.  Then just navigate to http://localhost/directory_wth_your_files/ with your web browser of choice.

      That's it in a nutshell. If you're interested in trying it out and need more detail, let me know and I'll see if I can explain it better.  I have a small setup running like this (publically on the 'net to save bandwidth and page load time).  The uncompressed HTML takes up 41.8 MB and the gzip encoded files are only 3.72MB.

       
      • roberthoff82
        roberthoff82
        2005-01-12

        Thanks for your help!

        The only feature I miss with that solution is the 7z compression.  The meat of this particular data happens to be a bunch of black-and-white .gif images, which are perhaps not likely to compress well losslessly individually.  I was hoping a 7z solid archive might be a different story.

        I think I could hack together a solution with 7z if there was a way to extract a file to stdout.  Let that be my humble feature request--and don't knock Win32 piping, folks; it has its applications!

         
    • Stepho
      Stepho
      2005-01-12

      You can extract to stdout using the -so switch.

       
      • roberthoff82
        roberthoff82
        2005-01-12

        NICE!

        Thanks so much.