Is there a cap on how fast scribe can receive messages? I've been doing some performance testing, and it looks like whenever I hit 100Mb/sec, scribe begins to fall behind. Here's my scenario.
I have 3 test servers, each with identical hardware and OS (debian 2.6.26-2-amd64, 8CPU, 16GB RAM), scribe 2.01, Thrift version 20080411-exported, and testing via the example scribe_cat.
I have used two testfiles. One is a 67k file, and one is a 25MB file. In both cases, results seem to be the same.
In the case of the 67k file, I'm testing by using a for loop.
On server A&B:
for i in `seq 1 100000`; do; cat testfile | ./scribe_cat -h localhost:1464 ads;done
I can see the log scrolling as each message is sent. To see how much bandwidth it could use up, I created several concurrent instances of this test, and it seems like once the aggregate as measured from server C reached 100Mb/sec, servers A & B would both get backed up. I start seeing failure messages from server C in the logs of servers A & B, and then they begin to write to the secondary interface which is local disk.
I observe the same thing when I use the testfile of 25MB, which makes sense.
Should I be able to achieve more than 100Mb/sec reliably? I don't expect any individual server to send data so quickly, but in the future, I want to have a scenario where I have over 100 servers talking to 1 central scribe server (or a cluster of highly-available, load-balanced scribe servers using linux HA possibly), and I expect the aggregate sustained rate to exceed 100Mb/s constantly.
Also, is it possible to throttle the rate, not by messages, but by packets/sec or Mb/sec? Regardless of how large the test file is, it's sent as 1 message, so throttling by msgs/sec isn't going to work.
Finally, I tried out scenarios where I started up multiple scribed's on server A & B to test their behavior sending to 1 scribed on server C. Is this recommended? Or do you recommend having at most 1 instance of scribed at any time on a given server, and have all helper scripts talk to just that instance? I can see config files getting pretty large and hairy in that scenario.
Servers A & B are acting as clients sending messages to server C.
Config file for A&B:
Config for server C:
I forgot to include, the nic's are all Gigabit.
Check Scribe’s counters on server C. (See the examples directory to learn how to do this). From your descriptions, it sounds like the ‘denied for queue size’ counter is increasing. Scribe has a configurable limit on how much data it should buffer in memory before it tells the upstream Scribe servers to throttle sending messages. When waiting on disk io, this in memory buffer can fill up.
I would suggest setting max_queue_size in the beginning of server C’s config file. The default value is 5000000 bytes, so try setting it higher and see what happens. If you increased this to say 50-100MB, you should get better throughput on your 67k file test case. I have some servers running that can do over 300Mb/sec depending on how fast disk io is on each server.
I don’t routinely test sending messages as large as 25MB. But I would imagine that you would need to increase max_queue_size even higher to be able to have server C process many large messages from many servers. Unless you are planning to use Scribe to send messages this large, I would instead test by sending many smaller sized messages.
Let me know if this works.
That indeed helps. I added it to both scribe server configs, and I'm able to get up to 500Mb/sec sustained. I'm trying to see how close I can get to Gig speed. 500Mb/sec seems perfectly stable, and I was able to reach 810Mb/s, and then I began to see 'denied for queue size' messages after about 60 seconds in that range. However, once I stopped new traffic generation, the queued up messages caught up quickly.