My scribed is using GB of RAM - within minutes of starting it's a few hundred MB, then within an hour or so it has reached 2.5GB virtual, 1.8GB resident. After a few days it reached 4GB virtual, 3GB resident. On a 4GB server, that's getting just a bit greedy.
It appears to be relative to the amount of data being processed. This particular instance is handling about 3GB of logs per hour, streamed straight to file (3 files open at any given time), rotated hourly. Other instances that are handling much less data tend to sit around 500MB virtual and almost nothing resident.
Is this just an old fashioned memory leak, or something in the configuration?
Using release version 2.01, linux kernels 2.6.18 & 2.6.24, config shown below:
Can you try downloading the latest version of Thrift and rebuilding Scribe? I know there were some optimizations made to Thrift a couple months ago to reduce memory usage.
Scribe shouldn't take up that much memory for your workload. I have some Scribe instances running that do almost 100GB per hour and use less memory.
Tried thrift from SVN head, latest scribe and boost-1-39, and situation is not much different.
It's clear that this is mainly a problem on the scribe instances that aggregate messages from upstream scribes. Below is memory usage data from (A) final aggregator, (B) 4 intermediate aggregators and (C) 60 or so servers that originate the messages:
A. 6894300 3758300
B. 10791308 4710076
C. 401676 7124
(... and lots more the same using less memory).
Setting new_thread_per_category=false helped a little (it hasn't crashed out due to reaching 4GB limit since), though as you can see, the usage is still quite high.
When a large backlog of messages comes through it definitely causes a step increase in memory requirement (not surprisingly). The memory increase is not monotonic - from time to time it will drop significantly back to reasonable numbers (MB rather than GB) but soon increases again. This is probably normal behaviour due to heap fragmentation, but nevertheless, it still seems to be allocating much more memory than it actually needs.
I'm trying to put together a test case to demonstrate the problem in a controlled environment. In the mean time, any other suggestions?
Have found the problem. The max_size on the secondary store on all servers is set to 1GB. At some stage, due to network interruption, the secondary store was used and had several 1GB files waiting to be sent. When the connection is re-established, an attempt is made to resend the whole file. On the receiving end, an error occurred (resource temporarily unavailable), so the files were put back in the q
Sorry, hit submit by accident...
Have found the problem. The max_size on the secondary store on all servers is set to 1GB. At some stage, due to network interruption, the secondary store was used and had several 1GB files waiting to be sent. When the connection is re-established, an attempt is made to resend the whole file. On the receiving end, an error occurred (resource temporarily unavailable), so the files were put back in the queue to be resent again later.
Thus the receiver was malloc-ing GB at a time to handle the large message sets, and it kept happening due to all the resending. Blew over 100GB of our bandwidth in a couple days...
The solution is to set max_size to a suitably smaller number - however, in some circles this particular problem might be called a bug, or at least an opportunity for improvement - ideas include:
a. add a note to the documentation that max_size for the file is also the size of the buffer to be sent, or
b. add a second parameter which controls the max size of any single message buffer to be sent over the network, and/or
c. be able to handle a count of messages received so that full buffers do not need to be resent in case of error.
Just some ideas.
Thanks for taking a look into this issue. It sounds like your analysis of this problem is dead on.
I will add a note about this to the Scribe documentation as you suggested. (Currently, the wiki is down and I have opened up a ticket with Sourceforge to get it back up).
I think your suggestion (b) sounds like a good idea for a fix if reducing max_size isn’t a good workaround for you. With some changes that are in the most recent revision of Scribe source, this fix would probably be pretty easy to implement. I believe only NetworkStore::handleMessages() would need to be changed.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.