Scribe memory usage

Jon Schutz
2009-08-14
2013-04-11
  • Jon Schutz

    Jon Schutz - 2009-08-14

    My scribed is using GB of RAM - within minutes of starting it's a few hundred MB, then within an hour or so it has reached 2.5GB virtual, 1.8GB resident.  After a few days it reached 4GB virtual, 3GB resident.  On a 4GB server, that's getting just a bit greedy.

    It appears to be relative to the amount of data being processed.  This particular instance is handling about 3GB of logs per hour, streamed straight to file (3 files open at any given time), rotated hourly.  Other instances that are handling much less data tend to sit around 500MB virtual and almost nothing resident.

    Is this just an old fashioned memory leak, or something in the configuration?

    Using release version 2.01, linux kernels 2.6.18 & 2.6.24, config shown below:

    port=1463
    max_msg_per_second=2000000
    check_interval=3

    # DEFAULT                                                                            
    <store>
    category=default
    type=file
    file_path=/share/logs
    rotate_period=hourly
    rotate_hour=0
    rotate_minute=5
    create_symlink=yes
    max_size=10000000000
    </store>

     
    • Anthony Giardullo

      Can you try downloading the latest version of Thrift and rebuilding Scribe?  I know there were some optimizations made to Thrift a couple months ago to reduce memory usage.

      Scribe shouldn't take up that much memory for your workload.  I have some Scribe instances running that do almost 100GB per hour and use less memory.

      -Anthony

       
      • Jon Schutz

        Jon Schutz - 2009-08-17

        Tried thrift from SVN head, latest scribe and boost-1-39, and situation is not much different.

        It's clear that this is mainly a problem on the scribe instances that aggregate messages from upstream scribes.  Below is memory usage data from (A) final aggregator, (B) 4 intermediate aggregators and (C) 60 or so servers that originate the messages:

           SIZE    RSS

        A. 6894300 3758300

        B. 10791308 4710076
              2013972 1168124
              1581852 630228
              209724 90536

        C.   401676 7124
              207220 6952
              401308 6608
              204436 4632
              204352 4612
              205032 4604
              305856 4488
              204144 4388
              235096 4112
              781528 3852
              336592 3848
              398120 3728
              688184 3444
              678848 3424
              398048 3412
              647208 3364
              230176 3084
              96852  2700
              86952  2516
        (... and lots more the same using less memory).

        Setting new_thread_per_category=false helped a little (it hasn't crashed out due to reaching 4GB limit since), though as you can see, the usage is still quite high.

        When a large backlog of messages comes through it definitely causes a step increase in memory requirement (not surprisingly).  The memory increase is not monotonic - from time to time it will drop significantly back to reasonable numbers (MB rather than GB) but soon increases again.  This is probably normal behaviour due to heap fragmentation, but nevertheless, it still seems to be allocating much more memory than it actually needs.

        I'm trying to put together a test case to demonstrate the problem in a controlled environment.  In the mean time, any other suggestions?

         
    • Jon Schutz

      Jon Schutz - 2009-08-17

      Have found the problem.  The max_size on the secondary store on all servers is set to 1GB.  At some stage, due to network interruption, the secondary store was used and had several 1GB files waiting to be sent.  When the connection is re-established, an attempt is made to resend the whole file.  On the receiving end, an error occurred (resource temporarily unavailable), so the files were put back in the q

       
    • Jon Schutz

      Jon Schutz - 2009-08-17

      Sorry, hit submit by accident...

      Have found the problem. The max_size on the secondary store on all servers is set to 1GB. At some stage, due to network interruption, the secondary store was used and had several 1GB files waiting to be sent. When the connection is re-established, an attempt is made to resend the whole file. On the receiving end, an error occurred (resource temporarily unavailable), so the files were put back in the queue to be resent again later.

      Thus the receiver was malloc-ing GB at a time to handle the large message sets, and it kept happening due to all the resending.  Blew over 100GB of our bandwidth in a couple days...

      The solution is to set max_size to a suitably smaller number - however, in some circles this particular problem might be called a bug, or at least an opportunity for improvement - ideas include:

      a. add a note to the documentation that max_size for the file is also the size of the buffer to be sent, or

      b. add a second parameter which controls the max size of any single message buffer to be sent over the network, and/or

      c. be able to handle a count of messages received so that full buffers do not need to be resent in case of error.

      Just some ideas.

       
    • Anthony Giardullo

      Thanks for taking a look into this issue.  It sounds like your analysis of this problem is dead on. 

      I will add a note about this to the Scribe documentation as you suggested.  (Currently, the wiki is down and I have opened up a ticket with Sourceforge to get it back up).

      I think your suggestion (b) sounds like a good idea for a fix if reducing max_size isn’t a good workaround for you.  With some changes that are in the most recent revision of Scribe source, this fix would probably be pretty easy to implement.  I believe only NetworkStore::handleMessages() would need to be changed.

      Thanks,
      Anthony

       

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks