Is my understanding correct that lessfs will work well with deduping files that are tarred as long as they are not compressed with gzip?
Compression will most certaily defeat deduplication, but tar itself may also present a problem. Consider a tar file containing two files 'a' and 'b', concatenated in that order. If the size of 'a' changes by a single byte, the block alignment of 'b' might change. This can be solved via the use of 'post' deduplication with varying block size, obviously a much more complicated technique than the block based inline deduplication which considers each data block as they are written.
For some versions of 'tar' it is possible to adjust the blocking factor from the default 20 (10240 byte block size) to something that matches the underlying filesystem. Change it to a power of two (e.g. 8 for 4096 byte blocks) and it might solve the problem, but I have not tested this myself.
I might add that in contrast to 'tar', 'dump' stores raw disk blocks for each file in the filesystem and will thus never affect the block alignment of the files as they are backed up. Dumping is obviously limited to ext2/ext3 filesystems, so this may or may not be useful to you.
The blocking factor in tar affects IO size to/from medium. Aligning this with the block size selected for the lessfs filesystem would likely yield better results, but it wont affect the internal layout of the file. At least it didn't with my testing.
As a suggested feature would be a pluggable module that detects certain file types and optimizes the storage. Within the lessfs_write routine could be a check for incomming filetype (steal code from "file" command) that would handoff IO to a pluggable module (so you could add filetypes more easily). That module would be responsible for breaking the stream into appropriate blocks and also called during lessfs_read to reconstitute the original stream. In the case of a detected TAR stream each file could be broken up so that files always began on a block boundary. This would greatly enhance the filesystems ability to dedup data. One can imagine a pluggin for most archive type files (zip, tar, cpio, pax, etc.).
The consequences of such a scheme from the file system's point of view are mind boggling, consider a request to seek() an arbitrary offset of a file stored this way. Where is the block data located on disk when block!=offset/blocksize? The first and foremost priority of any file system must be to guarantee that whatever you put in comes out exactly the same, within a reasonably predictable time. I'm not saying it can't be done, but it would require a completely different approach to the design of the file system.
A more viable approach would probably be to add some kind of file-to-block alignment as an optional feature in tar to make it more suitable for block level deduplication in general. Such a tweak could probably be made backwards compatible with very little effort. Ofcourse, this is slightly outside the scope of lessfs :-)
I totally agree with Andreas on this subject. The file-to-block alignment should be implemented in tar or whatever is used to archive the data.