Re: [parchive-devel] Packed format

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello Michael,

On Monday, January 13, 2003, 1:19:36 AM, you wrote:

MN> Hi,
MN>         Before this lull in the discussion becomes an extended
MN> break, I thought I would propose a detailed optional extension
MN> that allows for a Packed format - one that supports greater than
MN> 32k files by allowing files to start on non-slice boundaries.

MN>         In this optional extension, there would be 2 new packet
MN> types - one that substitutes for the Main Packet and another that
MN> substitutes for the Recovery Slice Packet.

MN>         The Packed Main Packet would be the same as the Main
MN> Packet but add a new field - the sub-slice size.  The sub-slice
MN> size must evenly divide the slice size.     The sub-slice size
MN> must a multiple of 4.   Files always start on the next sub-slice
MN> boundary.  File order is the same as for the original Main Packet.

Sounds good.

MN>         The Packed Recovery Slice Packet has exactly the same
MN> format as the Recovery Slice Packet.  Only the type field in the
MN> header would differ.

I dont understand the reason for this. Since there would be no change
in the meaning of or method of computation of the packet, there is
little point in having a "Packed Recovery Slice Packet".

MN>         This is basically what I proposed before.   A possible
MN> modification to it would be to allow the sub-slice boundary to be
MN> equal to 1, which would allow files packed end-to-end as in
MN> Peter's initial proposal.   I don't think that is necessary,
MN> because setting the sub-slice size to 4 has at most 3 bytes of
MN> padding and keeps files 4-byte aligned, which we've already said
MN> is a good thing.

It is not the total amount of padding that was most important in my
original suggestion, it is the fact that there is any padding at all
between files.

The difference between using 4 byte subslices and 1 byte subslices is
very dramatic:

1) With 4 byte subslices you are still limited to 32K files.

2) With 1 byte subslices there is no limit on the number of files.

In general, when handling very large numbers of files, the subslice
size would have to be roughly equal to the average file size. In that
case you would have a considerable amount of wastage due to padding.
You would also be forced to use extremely large RS matrices.

The only time you could ever use a smaller subslice size is when the
number of files is small (and the slice size is already smaller than
the average file size). Using subslices would then reduce the padding
as much as you want.

Only when you completely eliminate all padding between files do you
completely remove the problem of how many files can be handled, and
also eliminate the wastage due to padding at the same time.
Additionally, you would be able to use much smaller RS matrices.

As has been discussed however, both the use of subslices and complete
packing does introduce the potential to require multiple RS matrices
when recovering (although this can be reduced if you have sufficient
extra recovery packets available).

The 4-byte alignment issue for structures was mainly a memory
optimisation, and need not be related to the way files are packed or
padded.

MN>         As always, comments are welcome and encouraged.
MN>                 Mike

If we are going to start putting in official proposals for new
packets, then I would like to add the "Alternate File Verification
Packet".

The "Alternate File Verification Packet" would be identical to the
existing "Input File Slice Checksum Packet" except that it would have
an extra field which would specify the slice size used for
verification of that file.

The purpose of the packet is to permit file verification to take place
with a different granularity from that used when error recovery is
carried out.

e.g. With 800MB of data, instead of using 4000 x 256Kb slices for both
verification and RS computation, you could choose to use 1000 x 1MB
slices for RS computation and 8000 x 128Kb slices for verification.

Dropping to 1000 slices for RS computation results in a significant
speed up (i.e. 4 times as fast), but a reduction in recovery
probability, increasing to 8000 slices for verification increases the
likelihood of finding useable data within individual articles when
posted on UseNet (and hence reducing the amount of recovery data
needed).

-- 
Best regards,
 Peter