Re: metadata

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Tom Metro wrote:

> Shachar Shemesh wrote:
>
>> The symmetric key is all the information about the encryption procedure
>> (keys, parameters, etc.).
>
>
> Ah, so it already is a block of meta data.
>
> Would you mind if we came up with a new name for this file in the
> documentation?

Tom, with all due respect, I think it is time someone put their money
where their mouth is. If you want to discuss documentation changes, do
so with a patch against CVS.

> Aren't you concerned that loss or corruption of 'filelist' could
> render an entire collection of files as near useless?

No, not really. There is an unencrypted version, as well as an encrypted
version. The encrypted version is accessible through the same private
key that unlocks the rest of the encrypted files.

> Why choose a single file model for this data, when you choose multi
> file model for the symmetric keys?

Because, unlike the symmetric keys, this file has to be at a known
location, and cannot be done without. Putting the data currently in the
symmetric keys into the single file is an option to consider, but I'm
not sure I have managed to wrap my mind around all implications of doing
so yet.

Another reason just came to mind. You need the information in filelist
in order to find which is the file you refer to. This means that if you
wanted to store this information in seperate files, you will need to
read each and every one of them anyways.

> You said above that the symmetric key files really contain more than
> the actual key, so why not extend it to include this additional meta
> data?

Because the key file contains data about the encryption, while filelist
contains data about the unencrypted file. It's just not the same thing.

> I would think it would be worth breaking backwards compatibility for
> the vast benefits of having the block of meta data stored inside the
> file be identical to the block stored externally (with the exception
> that one is encrypted, of course).

See, that's the whole point, though, isn't it? If information crucial
for finding which file is which is of a different level of importance
than information about a specific file.

> Consider that you can then use the same chunk of code to process the
> meta data, regardless of where it was stored. And that you can ditch
> all the special case code you'll have to add for dealing with
> 'filelist'. And 'filelist', being a "sequence of 'chunks'," is
> essentially a database, which is bound to require even more code to
> manage, as well as introduce potential memory issues when dealing with
> huge file sets.

I guess I'll give you the same answer I gave you in my previous email. I
think you are suggesting unimplementable solutions here, but feel free
to prove me wrong by sending in patches against CVS. Grabbing the latest
CVS according to the instructions in the site will get you the most up
to date version I have, almost always.

> I don't follow why that requires either an external file or a separate
> file. Yes, an external file is necessary to avoid needing the private
> key on decryption, but you've already got an external meta data file.
> (And if the user doesn't have the external meta data file on hand,
> then they need the private key anyway.)

If a file called "/etc/passwd" is stored in the encrypted archive as
"As9sm23irmsk", and the only way to correlate the later name to the
former is through a correlation data, how on earth do you propose to
store this correlation data inside "As9sm23irmsk", and encrypted at
that. It means that if you ask to decrypt "/etc/passwd", rsyncrypto has
to go over all the files in the archive, decrypting the private key
header of each, and trying to locate the right one.

This is even before I start talking about limited RSA block sizes and
other technical problems with encrypting arbitrary length data using an
assymetric cypher.

> It would add to the prerequisites, but might have been less work to
> link in an XML parser. (One of the ideas behind XML is to write a
> decent parser once, and not have to reinvent one for every project.)

Could be. Too late for 1.0.0.16, but maybe in the future. See, the
parser is already writter :-).

> Otherwise the data structure seems decent. A magic number, which would
> permit locating the file or meta data chunk in the event of
> corruption. Variable number of blocks, and variable size blocks. And
> the concept that unknown block types should be ignored, helping to
> maintain backwards compatibility.

Actually, the magic number only serves to identify the file in case we
need to change the basic structure in the future (say, moving the file
over to XML format), while maintaining backwards compatibility. In any
case, thanks for giving me marks.

>   A writer must always issue all mandatory blocks for the file version
>   generated by it (as determined by the magic number at the start of the
>   file).
>
> You might want to make the magic number fixed and have the version be
> a separate attribute. Other programs/tools might want to be able to
> recognize the magic number, but only your program needs to be able to
> interpret the contents.

It's easier to just switch magics if something fundemental needs to be
changed. This also saves the trouble of trying to figure out how to
handle version 5 with magic 2 etc.

>   All strings are NULL terminated.
>
> Seems redundant if you're storing sizes,

I wasn't aware that I was storing sizes. Not of strings, in any case.

> unless you plan to pack multiple strings into a single block.

Could be necessary in the future, yes.

>   All blocks start on a file offset that is 4 bytes aligned. If a
>   natural block size is not a multiple of 4, writers must pad the block
>   with zero (null) bytes. The block length must include the padding, and
>   must divide by 4.
>
> What's the benefit of this? A bit of a performance boost once the
> structure is put into word-aligned memory?

Exactly.

> What about a block and/or chunk checksum?

What about them? What good is a checksum if there is nothing you can do
in case it's wrong?

>   == Block FFFF - End of Chunk ==
>
>   Writers must place this block at the end of each chunk. Readers should
>   assume that any data after this chunk is the begining of the next
>   chunk.
>
> I'm not sure that serves a purpose. If the file is not corrupted, then
> the chunk header tells you when you are done, and if the file is
> corrupted, FFFF probably isn't adequately unique to assist in
> reconstruction.

I would be delighted to hear in what way the chunk header provides this
information.

> If you stick with the idea of a single 'filelist' file, you might also
> want to use a magic number to mark the start of each chunk.

Why?

>   == Block 0000 - Platform ==
>   == Block 0001 - Original File Name ==
>   == Block 0002 - Encoded File Name ==
>   == Block 0003 - Posix File Permission ==
>
> What about an MD5 or SHA digest of the file, or is that stored
> elsewhere? What about the original file size, which could be utilized
> by -c?

Good ideas. I'll be expecting the patch by end of next week, which is
when 0.16 must, come rain or high water, be released.

> In your document you might also want to address that you aren't
> scrambling the files' time stamps, which theoretically is a leak of
> information, but a necessity in order for rsync to operate.

The document documents the filelist file format.

             Shachar