I have looked at http://tukaani.org/xz/xz-file-format.txt and it seems it is possible to extract byte n..n+1000000 from an xz archive without having to read it all. Using the output from 'xz --verbose --robot --list' I would think it would be fairly easy to build such an extractor, but does that already exist? (e.g. xzbytes --from 10000 --to 1000000 foo.xz)
In bioinformatics we use compression a lot: Many file formats are text files. Some are line based (e.g. SAM), some come in quad-lines (e.g. FASTQ), some are multiple lines (e.g. FASTA). Unfortunately most of they are variable length.
Common for them is that they are record based. For several tasks it would be handy to be able to extract records x..y without having to extract 0..x. To do that we need an index. Looking at XZ's indexes it ought to be fairly easy to build on top of that: Have an index name (e.g. NEWLINE) and for every block have an entry saying 'record number, uncompressed start byte':
The above would indicate that line 1244566 starts at byte 102838340. Using the uncompressed byte and the standard XZ index, it is easy to determine which blocks to unpack.
The NEWLINE index would cover both SAM and FASTQ (you just need to multiply the records by 4 to get the line number), but it would not work for FASTA. For FASTA a separate index is needed:
This indicates that record 10433 starts at uncompressed byte 2030451.
If the decoder of the index has no knowledge of the format (e.g. FASTA) then the decoder will need to know the beginning of each record. One way of doing that is simply a list of lengths of records in byte starting with record 0:
It would be fairly easy to implement such indexes in separate files, but I would find it so much more beautiful if it could be contained in the .xz-file itself.
Do you see a way of opening up for making user configured indexes?
XZ for Java has easy-to-use random access implementation. I still don't have comparable C code. It can be done with liblzma and xz does it to some extent for the --list option, but actual random access decompression isn't done by the xz tool.
I think I didn't fully understand the details of your use case but below are a few thoughts.
If the sizes of records don't vary too much, maybe it would be good enough to put e.g. thousand records per XZ Block. To find the record 2345 one would seek to the third XZ Block and skip the first 345 records to find the beginning of the record 2345.
If the above isn't good enough, then one possibility is to encode an index of the byte offsets of the records into the last XZ Block (or even separate XZ Stream at the end of the file). One would need to decompress that first and keep it in RAM for random access. A downside is that if one decompresses the file as a whole, then the index will be at the end of the uncompressed file too which might be unacceptable.
I have planned to add metadata support into XZ. The record index could be put into a metadata part of the file and thus it wouldn't be visible when the file is decompressed in the normal way since normal decompression will skip over the metadata. This idea isn't usable in the near future since metadata isn't even in the .xz spec yet.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.