Menu

#11 epub unzip problem

open
nobody
None
5
2012-08-22
2012-08-22
No

I just came about an Adobe ADE DRM-encrypted ePub archive that gets partly clobbered when unzipping for repackaging for use with calibre ebook management. After examining the results of unzip -l (on a 64 bit debian stable system), which produced warnings of the type 'mismatching "local" filename' and examining the archive with a hex editor I discovered that the central directory contained a number of truncated filenames.
For example, the local filename correctly was OEBPS/EXAMPLE123/image/EXAMPLE456_r1.jpg, while central directory entries maintained the filename length correctly, but contained only "OE" or "OEBPS/EXAMPLE123/image/EXAMPLE456" tralied with zeroes and some control chars. Since extract.c copies the central directory filename over the local filename in line 1323, the correct local filename is replaced by the corrupted one, which then leads to inconsistencies in the epub reference structure later on. So I would like to add a modifier to the zip -F fix archive option that globally replaces the central directory filename with the local one or vice versa, so standard tools working will properly process the modified archive.
Now before I go deep into fiddling with the zip sources - is there perhaps already an easier way to achieve this? did I miss something in the zip format specifications that would explain the 'garbage' in the central directory supposing the epub 2.0.1 OCF container specification (http://idpf.org/epub/201) ? Is this perhaps a feature of zip archives mentioned somewhere in these application notes (http://www.pkware.com/documents/APPNOTE/APPNOTE_6.2.0.txt etc.) someone already is aware of?

Discussion

  • Peter Koellner

    Peter Koellner - 2012-08-22
     
  • Steven Schweda

    Steven Schweda - 2012-08-22

    > example-corrupt-fixed-screenshot.jpg

    Plain text might have been easier than a picture of plain text, but
    that does look like garbage.

    > [...] Since extract.c copies the central directory filename over the
    > local filename in line 1323, [...]

    As usual, information like a source-code line number might be more
    useful if you revealed which version of UnZip you were using.

    > [...] So I would like to add a modifier to the zip -F fix archive
    > option that globally replaces the central directory filename with the
    > local one or vice versa, so standard tools working will properly process
    > the modified archive.

    I'm not an expert in the "zip -F" code, but that sounds possible.
    Before doing much work on a new feature like that, it would be nice to
    know how the defective/misunderstood archive was created, and whether
    the unexpected content has any value.

    > [...] did I miss something in the zip format specifications that would
    > explain the 'garbage' in the central directory [...]

    No, to me, it looks like garbage. Can you ask the people who
    provided the archive how it was made, and whether there's some reason
    for the apparent defect(s)?

     
  • Peter Koellner

    Peter Koellner - 2012-08-22

    > Plain text might have been easier than a picture of plain text, but
    > that does look like garbage.

    Well, highlighting and comparing seemed easier that way.
    On the other two instances the truncated file name ends with 0x00 0x44 0x00 0x00 a couple of characters before the original filename ended, Only with the shown one there is some sort of data after that.

    >> [...] Since extract.c copies the central directory filename over the
    >> local filename in line 1323, [...]
    >
    >As usual, information like a source-code line number might be more
    >useful if you revealed which version of UnZip you were using.

    Hmmm... debian stable is not THAT old that it would use any other than the most recent stable release:
    UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
    Debian package version 6.0-4, and the line number is from the latest sources from sourceforge.

    > I'm not an expert in the "zip -F" code, but that sounds possible.
    > Before doing much work on a new feature like that, it would be nice to
    > know how the defective/misunderstood archive was created, and whether
    > the unexpected content has any value.

    No idea. Adobe Digital Edition manages to open the file, but complains about "minor errors". I don't know if that means these things. I guess it might be some sort of home-brewn serial number marker scheme by the publisher or a bug in their publishing tools. The ePub specifications say nothing about such deviations from the container format.

    > [...] did I miss something in the zip format specifications that would
    > explain the 'garbage' in the central directory [...]

    > No, to me, it looks like garbage. Can you ask the people who
    > provided the archive how it was made, and whether there's some reason
    > for the apparent defect(s)?

    Probably not. It was published by Bantam books, I don't know how their packaging process looks like, and my experience with the publishing industry is that they are not very forthcoming when ebook format details are being discussed - even if one might be able to reach someone who actually knows someone with the technical expertise...

     
  • Steven Schweda

    Steven Schweda - 2012-08-23

    > Hmmm... debian stable is not THAT old that it would use any other than
    > the most recent stable release:
    > UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
    > Debian package version 6.0-4, and the line number is from the latest
    > sources from sourceforge.

    For the record, we don't track the source code as modified by
    everyone else, and "latest" is not a useful description. My copy of the
    source for:
    UnZip 6.00 of 20 April 2009, by Info-ZIP. [...]
    has this statement at line 1243:
    zfstrcpy(G.filename, G.pInfo->cfilname);
    and nothing relevant at line 1323.

     
  • Peter Koellner

    Peter Koellner - 2012-08-23

    > UnZip 6.00 of 20 April 2009, by Info-ZIP. [...]
    > has this statement at line 1243:
    > zfstrcpy(G.filename, G.pInfo->cfilname);
    > and nothing relevant at line 1323.

    Ok. seems like the https://sourceforge.net/projects/infozip/files/latest/download link does not go to the latest release, but to 6.10beta while debian stable uses 6.0. So, yes, in 6.0 the filename check happens in the block starting with the comment about filename consistency checks at line 1225. Anyway, this would not be the place where to fix the filenames, since that would be in the zip sources, not in unzip.

    But I guess it will take a while to get more samples of this type of problem, since I only can check on ebooks I bought. I'll try to contact the publisher, but I am not very optimistic about that.

     
  • Peter Koellner

    Peter Koellner - 2012-08-23

    Ah, well, after looking at it from a different angle, I guess the problem could be reduced to the following (probably) fixable situation:

    If the filename size entry of the central directory entry and the file header are the same but on one of the two copies contains a zero-terminated string shorter than the given size, the shorter string is probably faulty.

    I have send a bug report to the retailer where I got the file since it is a bit unclear who actually does the final DRM-armoured epub packaging. There might be a epub packaging tool out there that produces faulty zip containers. Well, if zip should be able to fix this type of problem, it would probably a good idea to check for this type of error and apply a fix. I guess there might be some complications involved with UTF-8 filenames etc., so it might not be that trivial. So I guess unless someone with some experience tells me that it would be a good idea to take a look at the source I won't waste any more time with that.

     

Log in to post a comment.