> 1. the chunks have a logical size of 64MB, but they seem to have a
> variable real size. GFS2 is moving to a smaller chunk size, is there
> any reason not to allow the chunk size to vary either fixed on a file
> basis or vary on a block to block basis? This would have some
> implications for seeking within a file where records were being appended
> to multiple blocks simultaneously, but perhaps the benefits outweigh the
The code is setup with 64MB size---the largest any chunk in a file be.
You can change the constant, but it would still be fixed for the
deployment. The code changes are non-trivial to allow changing the
max. chunk size on a per file or by block basis.
> 2. given that it is now possible to append "records" atomically is there
> any interest in providing the record size and/or schema
> (protobufs/avro/etc) as metadata ?
The record sizes are variable. The schema etc are outside the FS; the
app can append the schema as a "record" at the beginning of the file.
> 3. if there were records what about variable size records a la
> protobufs, avro, etc. ?
This is supported today.
> 4. if there were variable size records would there be interest in having
> an index which would permit seeking to different records numbers?
This'd be nice to have (otherwise, have to build an index outside the
record-appended file). The chunkservers could conceivably build the
index as it gets records and can then stash them into a separate file;
so, whenver a new chunk is allocated for record append, you could
also allocate a new "index" chunk into which the index records go.
> 5. If there were variable size records and an index would there be any
> interest in being able to write/update records which are variable size
> and still preserve the ability to seek/read to other records (in other
> words, change the chunk size to handle the new record size).
You could do this if you don't blow past the max chunk size.
Alternately, you could "delete" the old record and append the new
> It seems like this would be a logical extension of the latest work
> (which looks really nice BYW).
> On 6/7/2010 10:23 PM, Sriram Rao wrote:
>> We are happy to provide a new release of KFS (kfs-0.5). This release
>> adds new features (particularly, atomic record append) as well as
>> stability/performance improvement over previous releases. In a bit of
>> 1. Add support for atomic record append. This capability is to enable multiple
>> writes append records to a file. Writers can be writing to same chunk of a
>> file or to different chunks of a file. The system guarantees that records will
>> not be split across chunk boundaries. The support for atomic record append
>> entails three parts: (1) metaserver to allocate chunks for append, (2)
>> chunkserver to receive data from multiple clients and interleave the data, and
>> (3) support in the client to construct records and send them to chunkserver.
>> To limit the # of concurrent writers to a chunk, we employ a space reservation
>> policy: clients reserve space on the chunkserver; if the reservation fails, the
>> client will interface with the metaserver, ask for a new allocation,
>> and retry. The atomic record append
>> operation can be used, for instance, to do log aggregation in a
>> cluster: Logger processes on individual
>> nodes in a cluster open a file in KFS and atomically append log
>> records to the file.
>> 2. Add reliability support in record append. To the writer, the reliability
>> protocol provides exactly once semantics. If the writer can't determine if the
>> write is committed at the multiple servers, the write will fail.
>> 3. Add support for doing adler-32 using Intel's IPP (Intel's performance
>> 4. Add support for a "chunk coalesce" operation: Data written to chunks in a
>> different files can be coalesced into a single file. That is, a container
>> file can be created and content from different files can be appended to the
>> 5. Add support for async read/write in the KFS client. With read/write, the
>> app can issue reads to data from multiple files/chunks concurrently; the client
>> code multiplexes I/O from multiple chunkservers concurrently.
>> 6. Add a rebalancer tool: the tool takes as input chunk sizes/locations, and
>> then constructs a plan (which lists what chunk needs to be moved where); the
>> plan should then be uploaded to a running metaserver, which then executes the
>> 7. Modifications to the write protocol for reliability. On each write sync,
>> the client sends the adler-32 over the data that it has sent; the chunk master
>> and the replicas should agree on the checksums; otherwise, the write
>> is failed,
>> and the client will retry.
>> 8. Performance tweaks to the metaserver for scale.
>> Acknowledgments: The code for atomic record append and other features
>> that comprise 0.5 release were funded solely by Quantcast Corp. It
>> was work done by Mike Ovassianikov and Sriram Rao. Thanks Quantcast!
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit. See the prize list and enter to win:
>> Kosmosfs-users mailing list
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit. See the prize list and enter to win:
> Kosmosfs-devel mailing list