From: Julian N. <ju...@pr...> - 2024-10-15 04:36:05
|
Hi all, I'm keeping this off the ticket system as an RFE for now as I'm guessing it may be perceived as a bit noisy - but I'd like to record some info regarding thoughts on zipfs mounts while I have my head in the space. The zip archive format is optimised for easy appends. The CDR - Central Directory Record, is near the end and can be rewritten cheaply. Essentially - remove Central Directory File Header record pointing to the old file - and append a new local file record and rewrite the CDR. That leaves an old copy of the file unpointed to - which is ugly from a security perspective as well as space used - but part of the process could also be to zero out the old contents and rewrite the old name to something like #deleted-filename Alternatively the entire old record could just be blanked out - as gaps are allowed in the zip container - but for reusability - it might be better to either do the renaming as above, or even consider having a metadata file that is in the zip that tracks space that has been freed. e.g perhaps something like .zipfs_freespace - that way we don't make assumptions about other apparent gaps or unpointed to regions in the zip file. The point is that by tracking such gaps - we could quite reasonably edit multi-gigabyte zips in place without copying/rewriting the whole file. New files could be added in tracked gaps if they fit. ie we could have have a writable zipfs mount. Obviously an edited zip wouldn't be the most compact possible representation - but something like a separate 'zipfs compact' operation could be made available to rewrite the whole thing when that is deemed desirable. (perhaps also some introspection command to see how much 'empty' space is in a zip) This potential usecase is made more difficult by the current heuristic used by tclZipfs.c to determine where the zip data begins. It assumes that first file/dir record in the CDR points to the first valid local file header. That mechanism for finding the point between exe & zipdata is only necessary because of the changes made in 2021 to use 'file based' offsets - and while it may work in most cases - I don't see it as particularly robust - hence my reference to it as a heuristic. If the tcl community wants to decide that zipfs as attached to a tcl exe or script *requires* that the first CDR file/dir entry points to the topmost local file header (which is not true under all zip editing scenarios) - then that's a decision that could be made as a specific decision to restrict us to a subset of what the zipfs container allows - but I didn't see it documented or discussed when the change was made - and the 2021 changes appear to have broken the 'zipfs info' command's ability to determine the exe/zip split offset as I describe in bug: https://core.tcl-lang.org/tcl/tktview/aaa84fbbc5 By not using 'file based' offsets and instead using simpler 'archive based' offsets (ie as you get by just catting file.exe with file.zip) - the calculation for finding the split is a simple maths operation, (because we have the size of the central directory and can then compare the recorded offset with our actual absolute position and subtract to get the baseoffset which represents our prepended data) We could plough ahead with the 'file based offsets' (offset adjustments) and fix 'zipfs info' to match (I think the problem with the 2021 fix is that 'minoff' was used to calculate the position but zf->baseOffset wasn't adjusted - a cursory look at the divergent androwish implementation suggests to me that maybe it was done right there) I think it would be a pity to persist with that as it would seem to make the system harder to cater for the usecase above, and is generally not as simple or flexible - but I'm hoping someone else can clarify which way Tcl wants to go. Currently, a tcl script wanting to split an exezip has to do their own zip parsing and potentially error-prone heuristics to work out the split on an offset 'adjusted' file. (There is no unique header representing the start of a zip archive - and from my own tests there are false positives in the tclsh exe binary. Scanning on largeish zips/binaries is ugly anyway) Cheers, Julian |