From: Peter B C. <pe...@sh...> - 2003-01-13 12:12:39
|
Hello Michael, On Monday, January 13, 2003, 1:19:36 AM, you wrote: MN> Hi, MN> Before this lull in the discussion becomes an extended MN> break, I thought I would propose a detailed optional extension MN> that allows for a Packed format - one that supports greater than MN> 32k files by allowing files to start on non-slice boundaries. MN> In this optional extension, there would be 2 new packet MN> types - one that substitutes for the Main Packet and another that MN> substitutes for the Recovery Slice Packet. MN> The Packed Main Packet would be the same as the Main MN> Packet but add a new field - the sub-slice size. The sub-slice MN> size must evenly divide the slice size. The sub-slice size MN> must a multiple of 4. Files always start on the next sub-slice MN> boundary. File order is the same as for the original Main Packet. Sounds good. MN> The Packed Recovery Slice Packet has exactly the same MN> format as the Recovery Slice Packet. Only the type field in the MN> header would differ. I dont understand the reason for this. Since there would be no change in the meaning of or method of computation of the packet, there is little point in having a "Packed Recovery Slice Packet". MN> This is basically what I proposed before. A possible MN> modification to it would be to allow the sub-slice boundary to be MN> equal to 1, which would allow files packed end-to-end as in MN> Peter's initial proposal. I don't think that is necessary, MN> because setting the sub-slice size to 4 has at most 3 bytes of MN> padding and keeps files 4-byte aligned, which we've already said MN> is a good thing. It is not the total amount of padding that was most important in my original suggestion, it is the fact that there is any padding at all between files. The difference between using 4 byte subslices and 1 byte subslices is very dramatic: 1) With 4 byte subslices you are still limited to 32K files. 2) With 1 byte subslices there is no limit on the number of files. In general, when handling very large numbers of files, the subslice size would have to be roughly equal to the average file size. In that case you would have a considerable amount of wastage due to padding. You would also be forced to use extremely large RS matrices. The only time you could ever use a smaller subslice size is when the number of files is small (and the slice size is already smaller than the average file size). Using subslices would then reduce the padding as much as you want. Only when you completely eliminate all padding between files do you completely remove the problem of how many files can be handled, and also eliminate the wastage due to padding at the same time. Additionally, you would be able to use much smaller RS matrices. As has been discussed however, both the use of subslices and complete packing does introduce the potential to require multiple RS matrices when recovering (although this can be reduced if you have sufficient extra recovery packets available). The 4-byte alignment issue for structures was mainly a memory optimisation, and need not be related to the way files are packed or padded. MN> As always, comments are welcome and encouraged. MN> Mike If we are going to start putting in official proposals for new packets, then I would like to add the "Alternate File Verification Packet". The "Alternate File Verification Packet" would be identical to the existing "Input File Slice Checksum Packet" except that it would have an extra field which would specify the slice size used for verification of that file. The purpose of the packet is to permit file verification to take place with a different granularity from that used when error recovery is carried out. e.g. With 800MB of data, instead of using 4000 x 256Kb slices for both verification and RS computation, you could choose to use 1000 x 1MB slices for RS computation and 8000 x 128Kb slices for verification. Dropping to 1000 slices for RS computation results in a significant speed up (i.e. 4 times as fast), but a reduction in recovery probability, increasing to 8000 slices for verification increases the likelihood of finding useable data within individual articles when posted on UseNet (and hence reducing the amount of recovery data needed). -- Best regards, Peter |