Hello,

I discovered that sometimes metaserver crashes silently during I/O operations (I used Fs2Kfs tool). Symptoms are: last message in log is "Starting layout for req:xxx"; backtrace of core dump looks like
------
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000457740 in boost::detail::atomic_increment (pw=0x49)
    at /usr/include/boost/detail/sp
_counted_base_gcc_x86.hpp:66
66          );
(gdb) bt
#0  0x0000000000457740 in boost::detail::atomic_increment (pw=0x49)
    at /usr/include/boost/detail/sp_counted_base_gcc_x86.hpp:66
#1  0x00000000004577b5 in boost::detail::sp_counted_base::add_ref_copy (this=0x41)
    at /usr/include/boost/detail/sp_counted_base_gcc_x86.hpp:133
#2  0x00000000004578b6 in shared_count (this=0x6e4938, r=@0x706da8) at /usr/include/boost/detail/shared_count.hpp:170
#3  0x000000000045795b in shared_ptr (this=0x6e4930) at /usr/include/boost/shared_ptr.hpp:106
#4  0x0000000000457c28 in __gnu_cxx::new_allocator<boost::shared_ptr<KFS::ChunkServer> >::construct (this=0x7094e0,
    __p=0x6e4930, __val=@0x706da0)
    at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/ext/new_allocator.h:104
#5  0x000000000048f961 in std::vector<boost::shared_ptr<KFS::ChunkServer>, std::allocator<boost::shared_ptr<KFS::ChunkServer> > >::push_back (this=0x7094e0, __x=@0x706da0)
    at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:606
#6  0x0000000000485153 in KFS::LayoutManager::AllocateChunk (this=0x6da3e0, r=0x709480)
    at /root/kosmosfs-0.1.2/kfs/src/cc/meta/LayoutManager.cc:359
#7  0x000000000046034b in handle_allocate (r=0x709480) at /root/kosmosfs-0.1.2/kfs/src/cc/meta/request.cc:413
#8  0x000000000045c035 in KFS::process_request () at /root/kosmosfs-0.1.2/kfs/src/cc/meta/request.cc:693
#9  0x000000000046f8c7 in request_consumer (dummy=0x0) at /root/kosmosfs-0.1.2/kfs/src/cc/meta/startup.cc:82
#10 0x0000003e936062f7 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003e922d0fbd in clone () from /lib64/libc.so.6
--------

The reason is a typo in LayoutManager::AllocateChunk(MetaAllocate *r) (src/cc/meta/LayoutManager.cc file). The following code

    for (i = 0; r->servers.size() < (uint32_t) r->numReplicas &&
            i < mChunkServers.size(); i++) {

must be replaced with

    for (i = 0; r->servers.size() < (uint32_t) r->numReplicas &&
            i < candidates.size(); i++) {


Sriram is aware of this issue and including the fix into next release.



--
Best regards,
Alexey Timanovsky.