Re: metadata

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Shachar Shemesh wrote:
>Tom Metro wrote:
>>Aren't you concerned that loss or corruption of 'filelist' could
>>render an entire collection of files as near useless?
> 
> No, not really. There is an unencrypted version, as well as an encrypted
> version.

There are two copies of 'filelist'? I guess I missed that in your write 
up. Though it doesn't change the situation much.

So one copy goes into the root of the destination directory, and gets 
encrypted, and the other copy goes into the keys directory and is left 
as plain text?

>>Why choose a single file model for this data, when you choose multi
>>file model for the symmetric keys?
> 
> Because, unlike the symmetric keys, this file has to be at a known
> location, and cannot be done without.
[...]
> You need the information in filelist
> in order to find which is the file you refer to. This means that if you
> wanted to store this information in seperate files, you will need to
> read each and every one of them anyways.
[...]
> If a file called "/etc/passwd" is stored in the encrypted archive as
> "As9sm23irmsk", and the only way to correlate the later name to the
> former is through a correlation data... 

Sounds like a pretty good point, though consider the usage scenarios.

I started with the assumption that the "meta files" would retain the 
original file name.

So to extract "/etc/passwd" the program would simply read in 
"keys/etc/passwd", and get the translation to 
"dest/a97a66d03c4a/As9sm23irmsk." (I presume you're scrambling directory 
names rather than mapping to a flat hierarchy.)

Are there any scenarios in which the program would be given the 
encrypted file name, and then need to locate the meta file? If you're 
doing a batch operation (decrypting all files), you could avoid the 
issue by iterating over the meta files instead of the encrypted files. 
Or if such an operation is rare, you simply accept the overhead and 
extract the meta data from the encrypted file's header (if it gets 
stored there).

Another trick that would make navigation to meta data stored in 
individual files faster would be to create a parallel hierarchy using 
hard links (which are supported on both UNIX and NTFS). Then 
"keys/a97a66d03c4a/As9sm23irmsk" resolves to the same file as 
"keys/etc/passwd." Though I'm not convinced that this is at all necessary.

As for the idea of storing the encrypted file name translation in the 
encrypted file's header...

> ...how on earth do you propose to
> store this correlation data inside "As9sm23irmsk", and encrypted at
> that. It means that if you ask to decrypt "/etc/passwd", rsyncrypto has
> to go over all the files in the archive, decrypting the private key
> header of each, and trying to locate the right one.

Yes, if you don't have the external "meta files" on hand.

I consider the meta files to be like a cache. They're nice to have to 
speed things up, but if you don't have them, and it necessitates 
decrypting all files (or their headers) in a set to find a specific 
file, that seems like a reasonable price.

Again, I'd consider usage scenarios. If the intended purpose of 
rsyncrypto is the storage of backup files, extraction will be a rare 
operation, and it is acceptable for it to be slow. Any user who recovers 
their files after a loss of the originals is going to be a happy user, 
and isn't going to mind that the process might be an order of magnitude 
slower than a simple copy of unencrypted files.

>>Consider that you can then use the same chunk of code to process the
>>meta data, regardless of where it was stored.
> 
> I think you are suggesting unimplementable solutions here...

Does the above clarification resolve that concern?

> Putting the data currently in the symmetric keys into the single file is
> an option to consider...

Interesting thought, but not an approach I'm voting for.

>>You said above that the symmetric key files really contain more than
>>the actual key, so why not extend it to include this additional meta
>>data?
> 
> Because the key file contains data about the encryption, while filelist
> contains data about the unencrypted file. It's just not the same thing.

Yet both are needed in order to recover the original (with the exception 
that if the 'filelist' files are lost, you're hosed).

Consider it from an operational standpoint: on initial encryption, 
you're writing stuff to both the symmetric key file and filelist 
describing how the file was packaged, and on decryption you are 
consulting both of those files to determine how to recover the original.

Practically speaking, there is little differentiating the two sources of 
information, except that a lost symmetric key is recoverable.

> This is even before I start talking about limited RSA block sizes and
> other technical problems with encrypting arbitrary length data using an
> assymetric cypher.

That's a good point. But given the choice between storing the complete 
meta data only in external files, or taking on the extra overhead of 
storing the meta data in an additional AES encrypted chunk as part of 
the encrypted file's structure, I'd take the latter.

>>  All strings are NULL terminated.
>>
>>Seems redundant if you're storing sizes,
> 
> I wasn't aware that I was storing sizes. Not of strings, in any case.

If a block is defined like:

   == Block 0002 - Encoded File Name ==

   2 bytes : block length
   2 bytes : block type, always 0002
   string : The name of the file (ASCII)

unless you add additional variable length elements to that block, you've 
effectively defined the length of the string.

>>unless you plan to pack multiple strings into a single block.
> 
> Could be necessary in the future, yes.

Right, so best to leave it as you have it.

>>What about a block and/or chunk checksum?
> 
> What good is a checksum if there is nothing you can do
> in case it's wrong?

You don't blindly create scrambled output when trying to restore a file. 
You can notify the user that there is a problem. You can write a 
recovery tool that cleans a 'filelist' by throwing out corrupted chunks, 
allowing at least partial recovery of the file set.

>>I'm not sure [and end block] serves a purpose. If the file is not corrupted,
>>then the chunk header tells you when you are done...
> 
> I would be delighted to hear in what way the chunk header provides this
> information.

You define the chunk as:

   == Chunk Format ==

   Each chunk is composed of a series of specific data. The first two
   bytes are the number of data blocks in this chunk.

If I know there are N blocks in a chunk, I know I'm done processing a 
chunk after I've seen N blocks. No need for an end block marker. Same 
deal as knowing the size of a sting vs. null terminated.

>> ...and if the file is corrupted, FFFF probably isn't adequately unique
>> to assist in reconstruction.
> 
>>If you stick with the idea of a single 'filelist' file, you might also
>>want to use a magic number to mark the start of each chunk.
> 
> Why?

If the start of a block has a unique identifier, you can write a 
recovery tool that can re-synch after scanning past one or more corrupt 
chunks that don't have the expect block count or block size.

  -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: https://www.linkedin.com/e/fps/3452158/