Possible lost messages

Elbert
2009-08-17
2013-04-11
  • Elbert
    Elbert
    2009-08-17

    In my testing, I'm encountering message loss, or what I think is message loss. I'm back to testing with a 25MB file to simulate a sustained stream of data. My other option is to write a small script that opens a connection to scribe, sends some large amount of data through it, then closes the connection.

    In a set of 300 file sends, with the receiving server up to 900Mb Rx and the sending server up to 738Mb Tx, I am observing two separate but probably related scenarios.

    1) On the sending server (which is itself receiving from scribe_cat), I get "denied for queue size" messages, but when I stop the test and let the messages catch up, the "received good" line > the "sent" line.

    received good: 363
    denied for queue size: 118
    sent: 152

    It just gets wedged there, and when I look for the location of the buffered messages, I don't see anything buffered. I've checked the secondary interface, and no messages are pending. Also, when I restart that local scribe server, it finds nothing buffered to send.

    When I check the destination scribe server, it has received the 152 messages but not the remainder.

    2) This can also happen even when I do not receive "denied for queue size" messages, as shown in this excerpt.

    received good: 300
    sent: 202
    ---

    Again, the secondary interface shows nothing buffered, and when I restart that local scribe server, it finds nothing buffered to send. So, the messages appear to have disappeared. Is there another location I should be checking for the messages? The point of this scribe server is to act as a relay for an edge client, so all messages that are received should be immediately turned around and forwarded to a central server.

    I've checked the two dirs listed in this config: (/var/1464 and /var/elbert2), and the dirs are empty except for ./scribe_stats in /var/elbert2.

    Here's my local config: (The values probably aren't ideal, but I intend to tweak them to something respectable before this goes into production)

    port=1464
    max_msg_per_second=2000000
    max_queue_size=3000000000
    check_interval=3

    # DEFAULT - forward all messages to Scribe on port 1463
    <store>
    category=default
    type=buffer

    target_write_size=20480
    max_write_interval=1
    buffer_send_rate=1
    retry_interval=5
    retry_interval_range=2

    <primary>
    type=network
    remote_host=localhost
    remote_port=1463
    </primary>

    <secondary>
    type=file
    fs_type=std
    file_path=/var/1464
    base_filename=thisisoverwritten
    max_size=3000000
    </secondary>
    </store>

    # ADS - Send across the network only
    <store>
    category=ads
    type=buffer

    target_write_size=1096
    max_write_interval=0
    buffer_send_rate=1
    retry_interval=5
    retry_interval_range=2

    <primary>
    type=network
    remote_host=cc101.sc9.int
    remote_port=1463
    </primary>

    <secondary>
    type=file
    fs_type=std
    file_path=/var/elbert2
    base_filename=thisisoverwritten
    max_size=30000000
    </secondary>
    </store>

    Thanks!

     
    • Why are you testing with 25MB messages?  Do you expect this to be your actual workload?  I have not done any performance testing using Scribe to stream such large individual messages.

      Are you checking to see if scribe_cat returned successfully?  I am guessing that the scribe_cat script is returning ‘TRY_LATER’ and not actually sending the message to Scribe because scribe is currently overloaded.

      I would recommend you also increase the value of max_queue_size on all your machines.  This should definitely fix your problems with your smaller test case.  I am not sure how high you would need to tune this value to support your 25MB stress test.  But again, I think you should reconsider whether this test represents your expected use case.

      -Anthony

       
      • Elbert
        Elbert
        2009-08-18

        Anthony,

        I'm testing with both 67kB and 25MB message sizes. In the normal use case, I expect messages to hover in the kB range, but I was using 25MB to simulate an extreme spike in traffic. As it is, it served to allow me to get traffic up to within the maximum range I expected. I've seen traffic sustained at 900Mb on the receiving scribe server with no corresponding "denied by queue size" messages. Actually, in the error cases where I encountered "denied" messages, there were corresponding TRY_LATER messages returned as well from scribe_cat. However, now that I've adjusted max_queue_size up, I no longer get the "denied" or TRY_LATER messages, not even in the case of missing messages. Also, an interesting fact. Either of my two sending servers may not send all of its messages, but in my trials, it is only one that fails, not both.

        It may very well not be an issue in scribe but rather something in scribe_cat instead. I haven't looked at that code in much detail. Your point is well taken. I don't expect normal message size to be anywhere close to this, and in production, I won't be using scribe_cat :) I will definitely make sure that any homegrown apps that utilize scribe cap max_message_size.

        The one scenario where I might send files this large would be if scribe became a drop-in replacement for scp'ing log files from remote servers to the central depot. However, any individual file would be copied once, not 300 times, and this is a scenario we plan to not implement anyway.

        FYI, I just sent ~1M 67KB messages from 2 scribe servers to 1 central scribe server without dropping anything. Average msgs/sec handled was ~280. Rx hovered around 40Mb/sec and scaled up as I added more send instances up to a sustained 145Mb/sec. I'm pretty certain the ceilings are much higher than this.

        Thanks for all the help!