Reading a .7z file from a sequential stream

Help
Goober5000
2010-03-02
2014-07-28
  • Goober5000

    Goober5000 - 2010-03-02

    First, I want to say a huge thank you for your work on this project.  This is a very valuable library and I am very pleased to have found it… I spent a long time searching.  Keep up the good work. :)

    And now I have a developer question.  I am reading a .7zip file from a sequential stream of bytes, using a java.io.InputStream object that I convert to a net.sf.sevenzipjbinding.IInStream object.  Unfortunately, this presents some difficulty, because 7zip is designed to be read by random access (such as java.io.RandomAccessFile).  In order to seek backwards, I must close the stream, reopen it, and seek forward to the new position.  To mitigate this requirement, I am using a byte buffer so that small seeks forwards and backwards do not always require stream access.

    However, there is a significant conflict between the sequential stream and the 7zip API.  When the 7zip file is first opened, the library will seek all the way to the end of the file, so that it can read the table of contents.  This happens before any work is done!  Finally, when I want to extract a file, the stream will reset and perform the extraction as normal.  As you can see, this adds up to two passes over the entire file.  This becomes very expensive on a sequential stream with a large 7zip file!

    My question is whether it is possible to disable the initial read of the table of contents.  I know it is possible to do this for other formats… the standard Zip format used in Java also keeps its table of contents at the end of the file, however there is a java.util.zip.ZipInputStream which allows easy extraction using a sequential stream.  The ZipInputStream does not bother to read the table of contents; it only reads the short file header before each packed file.

    Thank you in advance for your help.  In exchange for your assistance on this problem, I am willing to send you the classes I have created to link the Java input and output streams with the 7zip API. :)

     
  • Boris Brodski

    Boris Brodski - 2010-03-02

    Hello Goober,

    thanks for the nice words. It really helps to know, that someone like it :-)

    I saw your post at

    http://www.hard-light.net/forums/index.php?topic=68372.msg1344587;boardseen

    and I had one question: why can't you just seek to the end of archive and download only TOC part first, then seek to the begining. What I mean is, if your web-server you downloading from supports resume operation, it supports seek operation as well (this is actually the same thing, I think). You could write a excellent smart download client, that seeks through your huge 7z archive and download only needed parts/files, extracting they on the fly.

    I wanted to write it myself for JDownloader project. It's still on my todo list ;-)
    See http://board.jdownloader.org/showthread.php?t=9567&page=2

    Currently I am working on archive creation support for 7-Zip-JBinding and this is a
    top priority for me (see voting results on my web site). But I'm very interested in the solution of your problem as well.

    If you have some time to spend on this task, we could probably solve this problem
    in a generic way serving your FSO Installer and JDownloader at the same time.

    What do you think?

    Regards,
    Boris

     
  • Goober5000

    Goober5000 - 2010-03-03

    Greetings Boris,

    I am impressed that you found my thread at HLP so quickly!  I want to make it clear than I am ranting at the problems using sequential streams, not ranting at 7Zip-JBinding. ;)

    I am trying to seek to the end of the archive, but the URL connection provided by Java does not support that.  When I do URL.openStream(), it returns an instance of class sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.  This class extends BufferedInputStream, which implements the seek operation by reading all the bytes, not skipping them. :(

    One solution would be to open a URLConnection which guarantees a random access stream if the web server supports random access.  But this does not seem to be provided by the standard JDK, and I have not seen a library for it either.

    I would like it if you worked on the smart download client, because I do not need archive creation support for 7Zip-JBinding. :D  But I know that others do.  Unfortunately, I do not have the time to work on JDownloader myself - I have too many other projects to work on.

    Regards,
    Goober5000

     
  • Boris Brodski

    Boris Brodski - 2010-03-03

    Hello Goober5000,

    the good Piwik (http://piwik.org/) made the trick :-) I can see all web sites, linking mine ^^

    Now I understand you problem.
    Did you checked out this tutorial:
    http://www.notes411.com/dominosource/tips.nsf/0/480C4E3BE825F69D802571BC007D5AC9!opendocument

    if (localFileSize > 0) {
      // server must support partial content for resume
      request.addRequestHeader("Range", "bytes=" + localFileSize + "-");
      if (client.executeMethod(request) != HttpStatus.SC_PARTIAL_CONTENT) {
        return false;
      }
    }
    

    I think, that the best bet for solving your problem is to get HTTP-seek to work.

    JDownloader: I never mind, you should work on JDownloader :-)
    What I meant is, if you are going to write such smart downloader using HTTP-seek + caching, you could write it in a bunch of generic classes (detached from your domain code). That this can be tested and reused easily by other projects. If you would be so kind to give with code to community, I could integrate it in JDownloader or probably in 7-Zip-JBinding. I could help you writing this smart downloader, but I can't write it myself right now. (Probably in a 3-4 month).

    Regards,
    Boris

     
  • Goober5000

    Goober5000 - 2014-07-06

    Okay. I finally implemented HTTP seek, and the results are much better: it is no longer necessary to stream the entire file in order to reach the table of contents at the end. The feature required minor changes to InputStreamInStream and InputStreamSource, so I have updated the files in the previous post.

    I have also uploaded a new file that provides a sample implementation of InputStreamSource:
    http://staff.hard-light.net/goober5000/downloads/SampleInputStreamSource.java

    This is the key that allows HTTP seek to work. As Boris described in his post above, the connection is opened with the Range parameter, which allows the InputStream to begin streaming at any byte position, not just the beginning.

    I hope others find this code helpful. Please post if you are able to use this code in your own projects. :)

     
    Last edit: Goober5000 2014-07-06

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks