Req: Add Split Files compatibility to panning

Anonymous
2011-02-06
2012-12-07
  • Anonymous - 2011-02-06

    In Short:
    If I add a panning to the end of a .7z file, it still works.
    If I add a panning to the end of a .7z.001 or 7z.002 (splitted) file, it doesn't work.
    Please add atomicity to splitted files so I can pan every part of a split, and it is still compatible with other panned or non panned parts. I think each splitted file needs a EOF marker so wathever is added after that is not important

    In Long:
    I use to split large archives. When I upload online those files, those are checksum-checked, so if I try to upload a backup ( simply renamed ) of those files, they are not added because they have the same checksum as the original files.

    To change the checksum, the simplest way is to add a panning of "0x00" or "0x20" ( space ) to the end of the file.

    If I do this to a whole 7z archive, this one is still readable and working.

    But, If I add a panning to splitted files, the archive gets corrupted and non working.

    To make a comparison, if I add a panning to splitted files of a .rar archive, the new panned files are not only readable and working, but they are interchangeable with original ones, so if I have:

    file.part1.rar
    file.part2.rar
    file.part3.rar

    and I make

    file_panned.part1.rar
    file_panned.part2.rar
    file_panned.part3.rar

    the following combinations works:

    panned archive:
    file_panned.part1.rar
    file_panned.part2.rar
    file_panned.part3.rar

    mixed archive:
    file_panned.part1.rar
    file.part2.rar ( need to rename as others, but the file is not panned )
    file_panned.part3.rar

    I opened some .partXX.rar files with an hex editor, and:
    each file has the same HEX for 20 bytes at the beginning, except the first file of the split wich has 3-4 bytes different
    each file has, a few bytes before the end, 5 identical bytes, except the last file of the split which has 1 byte different.

    Instead, 7zip files have no recurrent pattern, it appears as a large consecutive file splitted like HJSplit does.

    This brings to another problem too:
    If I try to extract a corrupted splitted archive, 7zip tells me only that the inside file is corrupted, without telling me which part is actually the damaged one.
    I think this behaviour happens because each splitted file is not wrapped in an indipendent header with per-split CRC32.

    Thanks for this great tool.

     
  • Anonymous - 2011-02-09

    Maybe I didn't clarify my request.

    I am not saying to change in any way how files are splitted.

    I'm only asking to wrap the actual split stream, in an header ( at start and end of file ) so that 7zip reads only what is inside the wrapper, so if I add to the file via an hex editor, something after the end, it will just be ignored by 7zip, instead of giving error as now does.

    The actual header can be extended to support a per-file CRC so we can know which part is corrupted

    Thanks.

     
  • Lirya

    Lirya - 2011-02-09

    Sooo - let me get that straight:

    Because of YOU wanting to fool some upload-checking script so that you can FLOOD some file portal,
    Igor shall change an industry-wide used file protocol and maybe break compatibility?

    Maybe you can provide a real-life process in which this change could come in handy?

     
  • Anonymous - 2011-02-10

    Sure:
    Online hosting has unlimited space and once-per-life fees. ( so there is no chance of FLOOD. I already have access to unlimited space )

    It costs much less than any other backup media or service would. Actually, the $/GB approaches zero the more you transfer online

    However, nothing is perfect, and sometimes, some parts I upload, when I try to access them later ( even from another pc - because this way is the best metod to transfer files between pc not concurrently connected, overwhelming email attachments ) can get corrupted, or the link is slow, or can happen server crashes.

    As now, whenever I backup something, I make a 10% par2 redundancy, and as now this has been enough.
    However, for some archives, I want one or two full redundant backups.
    Achieve this with only par2, as now, is impraticable because rebuilds would take hours at 100% cpu, and this costs time and electricity.
    Actually, I am re-archiving files wich I want to backup, but this makes new archives which are incompatible with the former ones. So, if I can get only partially the former archive, and the new archive, there is no way I can combine them to gain original files. I have made 2 backups with double point of failure.
    Instead, if I am able to upload two/three interchangeable sets, I get only 1/2 or 1/3 of point of failure, because all the same files needs to get corrupted to make the archive unreadable, but at that point, I can use my 10% par2 redundancy to rebuild the missing part for all backups.

    I try to make an example:

    A1       B1           C1
    A2       B2           C2
    A3       B3           C3
    A4       B4           C4

                P1

    A,B,C are interchangeable archives, which I was able to upload because the panning changes the checksum
    P are parity archives, they will work for all A,B,C parts.

    So, if A1, B2, B3, C3 and C4 get corrupted, I can take the other parts, pan/trim them and rebuild the missing pieces in seconds.

    X         B1           C1
    A2        X           C2
    A3        X             X
    A4       B4           X

    If A3 gets corrupted too, before I can get it, I can still use the par2 files to regenerathe A3,B3 and C3.

    So, the point of failure gets reduced a lot.

    BUT, if i continue as I'm doing now, with new compression of same files, I get that:

    1) I have to regenerate par2 files for each set ( minor problem )
    2) Major Problem: If, I get each set, more corrupted than what I have parity files, I cannot regenerate them, and all data is lost.

    In this example, if I have parity files to regenerate only 1 part, if each sets gets corrupted 2 parts, no matter what, data is lost, and I made three parity files ( one for each set )

    Instead, with interchangeable parts, the only way that archive is corrupted is if the two very same part of all of them is affected, which is less possible than 2 casual parts for each set.
    BUT, even in this case, if I want to make a full comparison to non interchangeable parts, I have to build THREE parity files. That's because, if parts are not interchangeable, than I am forced to make a parity file for each set.
    Now, I can make instead three par2 files which value for ALL parts of all sets.
    In a few words, this means that with three parity files, and 3 archives of 4 parts each, the only way to make the files unrecoverable is by having 4 very same parts of all archives corrupted ( in this case means the whole archives, but in general each archive can be of 5+ parts )

    In numbers, this means that chance of failure is:
    (chance_single_part_fail) * (1/number_of_parts)^(number_interchangeable_archives * (number_of_parity_files + 1))
    so in this example:
    (1/4)^(3*(3+1)) = (1/4)^12 = 5.9605e-008 * (chance_single_part_fail)

    As  now, chances are instead:
    (chance_single_part_fail) * (1/number_of_parts)^(number_of_parity_files_per_archive + 1) * (1/number_of_independent_archives)
    so:
    (1/4)^(1 + 1) * (1/3) = 0.0208 * (chance_single_part_fail)

    Which is more than 5 orders of magnitude (>10^5!) greater!

    The formula shows also that splitting allows to reduce chance of failure because each parity file can substitute whatever part of the set, so if the same bytes are splitted, I split the same bytes of par2 too to more par2 files, so I can repair more parts, and each one has a chance of failure < 1.

    There are no actual change to the protocol that breaks backward compatibility:
    If header is present, use it to read the split stream
    If header is not present, assume the split as a pure stream of bytes

     
  • Anonymous - 2011-02-12

    liralia, you were fast to accuse me of flood, but now that I gave you the explaination of my request, there is not even a comment from you…
    This means that you just wanted to post your hatefull message without even caring to read the answer that yourself requested.
    Are you sure you are not the one flooding here, with your messages? Because I have seen others you wrote - and they have the same hatefull stamp

     
  • SeldomGood

    SeldomGood - 2011-02-13

    So, you can add zero-padding to files…
    then you can also remove it.
    Problem solved.

     
  • Lirya

    Lirya - 2011-02-13

    Sorry, I'm normally not online on WE.
    Just dropped by to tell you that - see how nice Iam! ;-)
    Have a nice Sunday.

     
  • Lirya

    Lirya - 2011-02-14

    Dearest Proxy,
    thanks for this nice description.
    I can see that under some circumstances it could make sense maybe.
    You worked out the parity checking quite well! Nice working case.

    I thought about that long and hard and I would like to point out some facts:
    - that stuff bon-de-rado said. Why dont you just add some "panning", upload the files, and remove the panning before extracting?
    - uploading 3 same files is still (some) flooding. If you need 100 files, you will upload 300, if 10000 you will upload 30.000 and so on. It's still the same web-portal. It's not without reason that they do have this checksum-rejecter.
    - since you expect the same portal to not host your files correctly (having errors in 3 same files on different locations within that file) you should maybe think about changing that portal to some hoster who can be trusted. I have quite some cheap webspace and use (s-)ftp - never got any problems. Connection errors occur, but I can resume automatically on the last position.
    - breaking backwards-compatibility: if 7z sees a file and its not in the spec it will quit. Panned files are not in the spec. So you would break the compatiblity. Though: I really don't get why you just can't remove the panning before extracting. Can you clarify that?

    Will send another reply in some minutes…

     
  • Lirya

    Lirya - 2011-02-14

    So let's say we are totally cheap with our backups and would like to do some fancy risk-distribution while not being error-prone.

    So we want to do this:
    - backup a file (can be ISO or a TAR, incremental backup or whatever)
    - backup file must be save - not for anybody to use
    - backup file has to be 100% correct
    - backup file can be any size
    - storing said file on free html webhosters (let's say there are a good number of them out there)

    list of things to consider:
    - free hosters have a limit of (let's say) 100MB per file
    - free hosters could 'lose' a file
    - transferring to and from free hosters could introduce bit-errors (though the should be handled by the lower layers)
    - free hosters doesn't like hosting the same file more than once

    what we need:
    - strong encryption
    - compression
    - splitting fieles
    - more than one free hoster
    - bit/byte CRC checking

    Please correct this list if it's not true to your intentions.

    So for me it seems to be easy to solve this:
    1. gather the file together
    2. 7z this to one big file using encryption (password)
    3. split the files using a file splitter (to let's say 100MB chunks)
    4. for every chunk save CRC data (so you will see if this file is correct or not  - you can do some par2 magic to recover stuff if you want)

    Example:
    You do have 1 GB data. Splitted to 100MB chunks you will have 10 files now. For every file you have the CRC data or par2 recovery information.

    5. now take 3 hosters and upload to each hoster every chunk once

    When you want to try to get your data back you need to:
    a) download all chunks once from the hosters (you can mix hosters to download faster since the most have download speed restrictions)
    b) check those chunks on your local drive to see that they are ok
    c) for each chunk which is not okay download this chunk from the next hoster and repeat b)
    d) join the now totally correct chunks back together using the splitter again: you will get the big file again
    e) extract the big file using 7z and your password

    If you are a good scripter you could maybe script this stuff.
    Also you should make a nice database of all your chunks on those hosters.
    A daily job should go through each hoster and alert you if some chunks cannot be reached anymore.
    An automated correction job should re-upload those missing junks using data from the other hosters.

     
  • Anonymous - 2011-02-15

    -Trimming is not as simple in batching as panning - panning is just a "copy /b file1+file2 file_panned", however with some trick, it is still possible.
    But i thought that adding and EOF marker would not be so difficult if you already know how the split files are made - and just asking cost nothing
    -Actually the checksum is not to reject, but to substitute: let's say an online file gets corrupted or the server it is on has a crash: if you
    can reup the file, all the references of the preceding just point to the new, so if you keep a database linking to all those files, no record are
    to be changed. So, we can say that the checksum is done in a way "client-wise".
    -I'm not talking about error when you upload: it's obvious that if you have this error, you can just be sure to resume or reset the upload and make
    it complete. I'm talking when you try to access it later and you don't have the original files on your HD any more. It's not a matter of trusted
    or not hoster: as now, on thousands of files, I got only one error server-side. There is just some sensitive-data I want to backup for longer time so
    I want to be protected even in worst case
    - I don't understand what you say about backwards-compatibility: if you pan one-whole 7z file, because it is already wrapped and has the footer EOF
    marker, it is still readable. The problem raises only on splitted archives, which have no EOF marker in each split. If you are referring to opening
    new files with older version of 7zip, is like asking to open a LZMA2 7z with the 4.XX version, and this is not right. The rules is that archives made
    with older 7z versions will always be readable by newer releases, but new archives have the requirement to be read only starting from the release
    with which they are made - and a EOF marker keeps this rule true.
    As I already said, I thought that just asking would not harm and can be a nice addition for who wants to keep files online like me, and trimming
    is not as natural as panning. In best case, it needs double HD space, to accommodate new files, and time to generate them. Natural reading of panned files, would not need no new space no any more time.

    on your second post:
    2-3, actually there is no difference between file splitters and 7z internal split: they both just cut the byte stream on one file, and continue on the next
    5, Actually I have one hoster, with a once-per-life fee, unlimited space, no download bandwith limit. Using three different hoster has some flaws:
    I have to find as many hoster with once-per-life fee and unlimited space as how many clones of the archive I want. This means then to pay them all. Free are not an option - they can kill your files after some time, or have space restriction.

    The rest is okay, but the question remains - why make things more difficult than to add some bytes to do an EOF marker? With non splitted files, this already works.
    Afterall, if no more suggestion can be done to integrate something inside 7zip, I don't know why some new version should ever be released other than after a changing in compression algorithms…
    Actually, I downloaded 7zip source, searching for a way to make a 7z-mod myself, but it is not so simple…

     
  • Lirya

    Lirya - 2011-02-16

    My dear Proxy,

    after reading your last message twice and thinking about it I think I get closer to really understand your issue in all it's entanglementsness.
    I guess we have different understandings about the explanation of panning and trimming files.
    Also I'm taking back my argument with the FLOODING.
    But I guess I still don't get everything right.
    Quick question though:
    Why don't you just ADD eof-markers to the splitted archives?
    And remove it later, before re-joining the files?

     
  • Anonymous - 2011-02-16

    As I said, there is no natural way to trim back files, while the "copy" function is standard.
    If you know a way to trim files, maybe with system utils, without having to accommodate new files, please tell me.

    I think you have the better explain what you mean with flooding. I pay for unlimited space, is it not my right to use it? If I place 3 GB online, who cares if they are all different files, or 3 times 1 GB of same files?  Whatever check is made online is only for a reason - to allow the files to be referred always to the same link - think like if you shared a file to a your teamate, so instead to send him via email the whole file you just send him a link, than there is a server problem and file is unreachable or damaged, you just have to upload it again and original link you sent is the same. If links would have been always different, you would have to keep track of whonever you sent it and update them with the new link. In short, the checksum is computed to give file references long-term stability, if you can reup online the file if needed.

    The problem is if, like me, you want both to keep files and free local space, so you want to amplify online retention robustness.

    Having splitted files marked internally allows me to read them directly - without the need to generate new files.
    However I figured out the way I'll try to implement it:
    Place at the beginning of 7z.001 a string which identifies the possibility to panning, i.e. "abilitated to panning".
    Splitted files are always of the same size and this info, known at creation time, can be added to the string above, so in the stream decoding function I can stop to read from actual file once reached offset X, and jump on the next.
    The only file which is not same size, is the last: but it is not a problem at all, because last file has the footer header, and 7zip already don't read over that: if you try to pan the end of last file, the archive is not affected at all, because panning is after the footer header.
    Another way would be, instead of not reading over the offset, to place a footer mark ( i.e. the  string above ) in each split file, but  this can be source of problem so I prefer the offset way.

    Thinking your way, no more improvements have sense to be integrated any more in 7zip other than an internal change in the compression algorithms, because whatever one can think, there will always be a way, no matter how complex, to achieve it.
    On my side, I think that the more features are integrated in a simply way, the more a program is near to perfection.
    What I say may be uncommon, but who knows, once it is reachable in a simplified way, if it won't become widespread.

    For example, look at parity files:
    par1 have been integrated in winrar, they just called them .rev. If you ask in forums, many people knows of .rev files, and almost nobody of the far superior par2. That's because the simple integration of an already existing function, gave a boost to it.
    Now someone has made a par2rar program, this gave some wider access to par2 for par files, but not as the rev/par1 ones which are shipped with winrar itself.
    Now, just think if 7zip choose to integrate par2 generation in a transparent way: don't you think they'll naturally spread?

    Returning to my argument, I think that if people start to have a simple tool to make this kind of backups online, won't they choose to do it? Because I don't know anything else which can reach an asymptotic 0-cost.
    Maybe if I'm the first talking about this, is because the passes involved are not for an average user, so, whonever wants to keep some files online, stick with far-from-optimal solutions

     
  • SeldomGood

    SeldomGood - 2011-02-19

    The "dd.exe" program can be of help in padding and trimming: it can be found at

    http://unxutils.sourceforge.net/

     
  • Lirya

    Lirya - 2011-02-22

    My dearest proxy,
    thanks for your explanation - again very complete like every other one.

    I think I do understand your idea with 'implement more features' - but what you maybe don't know is:
    THIS SITE (7-zip) is par1 (slightly better) and there already is a par2 of 7-zip by someone else. It performs (slightly) better on many files.
    So 7-zip tries to not break any compatibility because it's the standard. Of course there can be numerous spin-offs trying to do it better! You can go there and give your ideas there - why not!

    As for your online-backup tool:
    if you would build such a tool this would be great.
    Maybe you could ask those file portal guys to do it or fund you doing it.
    I would guess that backup-software would be cloud-compatible nowadays so you might want to jump there directly.

    I still don't get why you can't just put some markers in the file after compressing and removing the very same before decompressing but don't bother explaining again - maybe I'm just too dumb. Also you have your solution to which I congratulate you.

    On another note I had another idea: when you split a file you do know exactly how many bytes it is.
    So by knowing this you could re-join any file-stream properly, because you would know exactly where file1 is ending and file2 is beginning.

    Maybe the perfect solution for this whole problem would be this:
    - build a nice software solution which gives a mask and installs some dll's with several steps available, which one can tick or untick.
    1. find files (where? which ones?)
    2. make it incremental (by comparing to files saved before)
    3. compress those files to one big file
    4. split the big file into junks
    5. work some par-magic to have it more redundand (also compress the par-files afterwards :-)  )
    6. upload the files to a specific portal (how often? which portals?)

    and vice versa.

    This software would definitely win some nice prices if it's free and good to use.

     
  • Lirya

    Lirya - 2011-02-22

    Also on 'what is unlimited file size' and 'why not to use those online services' there already are quite a lot discussions.
    Some providers are introducing a 'fair use policy' which allows them to drop files which are too big if the overal free space becomes too small. Some people say that in short time all of them will introduce that and that you can't even think of your data as 'safe' under these (although paid) circumstances.
    I for myself will try to use big cloud services if I have to use some online-backup really fast available (amazon S3 has a very nice offer right now) and for anything else I got my good old (S-)FTP server somewhere… or an external USB drive.
    What I have to admit is that I didn't see a software which is really nice for backups!
    So feel free to contribute to that - it would definitely be worth it.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks