#3 Pseudo code for avoiding memory leak?

v1.0 (example)
closed
nobody
None
5
2005-02-09
2004-11-22
Anonymous
No

I'm using Judy to insert IP addresses from web requests
as the Index with a custom Value, along with the time of
insertion for the ip address in a separate JudyL array.
Every 10 seconds or so I iterate through the Judy arrays
to dump the current contents of the judy array to a
hash file on disk, and to expire the judy entries based on
the times of insertion. With just 128 ip addresses
memory usage has already grown to 1548 as reported by
JudyLMemUsed. After a day I see about 1gig used
according to top, with the 10 second snapshot file still
showing ip's in the low hundreds. Is there some
recommended pseudo code to deal with such situations
in a robust manner or common type of race condition I
should be looking out for?

Discussion

  • Logged In: YES
    user_id=558029

    Dear xxx:

    My first reply just got eaten by windows. So this reply is
    going to be shorter -- sorry.

    I actually do not understand your question. So I will tell you
    what I do know. JudyL keeps track of how much memory it
    malloc()s and free()s. JudyLMemUsed() simply reports the
    current total. JudyLFreeArray() free()s all the memory in the
    array. If the total does not result zero (0), it reports a
    corrupt array. So I do not think there is a memory leak in
    Judy itself. The memory usage of 1548 bytes for 128 Indexes
    is about normal for JudyL. I take it you are using IP address
    as Indexes to one Judy array and some kind of a time stamp
    as an Index to another -- with the IP address as the value.
    The word "expire" suggest that you are doing a JudyLDel() of
    old IP addresses. If a pre-existing IP address shows up, are
    you also deleting the old time stamp?. Hopefully, you are not
    accessing the arrays from separate threads without
    approprate locks?. You did not mention the OS that you are
    using? Judy can store a lot in 1gig or memory. Does
    JudyLMemUsed report that amount of memory used? Most
    malloc()s will use about 5% more memory than JudyLMemUsed
    () reports. See "work_load_analyzer" in the Judy Examples
    section on http://judy.sf.net for a method of using sequence
    numbers for aging numbers -- ideas for a different solution to
    your problem.

    Good Luck,

    Doug Baskins

     
  • Logged In: NO

    Thanks for your response, I reviewed the workload
    example shortly after you posted that comment. I
    checked the system today and currently I see
    JudyLMemUsed showing 19600 for each of my Judy
    Arrays, so they are for sure in sync, and and I have 2091
    records in my hash file which is just an iteration through
    one of the Judy arrays. That memory usage seems right
    but if I look at top on the system I see SIZE 2525M, RSS
    434M, SHARE 1500. This is on GNU/Linux, kernel 2.6.8.1,
    gcc 3.3.2, libc-2.2.4. I believe thats the original libc from
    when the system was installed as RedHat 7.1. Just to
    reviewe what my application does, a JudyL array is
    populated by IP Address for the Index, and the Value's
    are unique codes that are fed to the app as well for
    insertions. Every so often I iterate through the entire
    JudyL array and output these IP and unique-code pairs to
    a hash file. To keep the JudyL array from growing
    exceedingly large, I routinely expire items from this first
    JudyL array by keeping a second JudyL array that again
    has the same IP addresses as the Index, but 32bit
    insertion times for the values. When I output my
    complete IPs and unique-code pairs, I also expire as
    needed and this seems to be working fine as my two
    JudyL arrays are of reasonable size, and exactly the
    same size. Looking at your comments and the work flow
    example you cited it seems I may be able to eliminate
    the bulky second JudyL array for expirations and instead
    use a Judy1 array, for example I currently a full day as
    my expiration interval, or 86400 seconds, but
    alternatively I could for example expire by inserting an
    initial 86400 IPs into the Judy1 array, and then later as I
    iterate through my entire IP list I can use J1C to count
    how far away that IP is from the end of the Judy1 array.
    If its beyond 86400, I remove it from Judy1, and my
    initial JudyL, hence cleaning up both of them. I think this
    could work, and I should probably go ahead and
    implement it, and if I still have memory issues I'll have
    to provide an even simpler test case to try to narrow
    down what the memory bloat is.

     
  • Logged In: YES
    user_id=558029

    Dear Mr. Nobody:

    I did not put a "serialize" or "unload" function in
    JudyHS() because I did not think of a good way to
    specify how to do that. (Read: the sematics of the
    calling routine(s)). It is fairly easy to get the
    following information from a JudyHS array.

    1: The number of different length of strings (character
    arrays). This number is typically fairly small --
    less than 200, if the maximum length string is less
    than 200 bytes.

    2: The number of strings of each length. If there were 200
    different lengths, then an array of 200 is necessary to
    contain the number of each length. (For example
    suppose you have 10 strings of length 3, 4500 or length
    4, 1000 of length 50...)

    Off the top of my head, this suggests something like:

    Word_t NLengths; // number of different length strings
    Word_t *PLengths; // array of number of each length
    void **PBuffers; // array of pointer to each length

    /* get the number of different lengths of strings */
    NLengths = JudyHSnumberOfLengths(JHSArray);
    if (NLengths == 0) goto Done;

    /* get memory to store number (sorted) each length of string */
    PLengths = (Word_t *)malloc(NLengths * sizeof(Word_t) );
    if (PLengths == NULL) goto MallocError;

    /* and memory to store the buffer addresses of each length */
    PBuffers = (void **)malloc(NLengths * sizeof(Word_t) );
    if (PBuffers == NULL) goto MallocError;

    /* get the array of lengths -- I.E. 200 in above example */
    Error = JudyHSGetNumbOfEachLength(JHSArray, PLengths, NLengths);
    if (Error) goto CallerBug;

    /* get memory of each length of string and fill buffer */
    for (ii = 0; ii < NLengths; ii++)
    {
    Word_t Length = PLengths[ii];
    void *Buffer = (void *)malloc(Length);

    Error = JudyHSGetStringsOfLength(JHSArray, Buffer, Length);
    if (Error) goto CallerBug;

    PBuffers[ii] = Buffer;
    }

    Now we have N buffers full of strings N different lengths. How
    you want to store this buffers on the disk is up to you.

    Notes:
    1) There is still a small problem that the "non-word" length
    strings are not necessarly on a word boundry.
    2) The strings in each buffer are not in any special order
    (hash).

    3) If JudyHS.c is compiled with -DDONOTUSEHASH, then all strings
    will be returned in sorted order in the buffers.

    4) There may be some "endian" problems if the strings are
    stored on a
    different endian machine they are read -- I need to
    think on that one.

    Let me know if you think this is reasonable, let me know and I
    will implement it. Or perhaps you just laugh.

    Doug Baskins

     
  • Logged In: YES
    user_id=558029

    OOPS, I put the next response in the wrong place and I do
    not know how to move it. So just ignore the next response by me

    Doug Baskins

     
  • Logged In: NO

    That certainly does look interesting. Just so you know the
    way I've been doing this currently is to include the equivalent
    of cdbmake code from the package cdb-0.75 into my
    program, so that as I iterate through a JudyL array I write out
    an encoded .cdb hash file. I was thinking the other day that
    my memory exhaustion problem may be do to not closing the
    file properly before I rename the new .cdb over the old one,
    so I will look into this. In my case the benefit of having the IP
    address pre-decoded as ascii text in JudyHS wouldn't outway
    the benefit of leaving the IP in its original compact 32bit as
    received from the network and placed into a JudyL. Although
    if I'm outputting these hash snapshots frequently, and there
    aren't too many IP's, I could use the extra memory to save
    cpu speed as your 3hour document explains between
    memory/cpu tradeoffs ;) perhaps there would be a benefit
    there. In any case I'd like to ask you what you think about
    just using an occasional memcpy of the entire JudyL or
    JudyHS structure and placing it into a file. Then, all my
    applications that open and read from this frequently
    updated .cdb can instead use the Judy library and mmap the
    JudyL or JudyHS data from file to retrieve what it needs. I've
    used .cdb files in this manner for many purposes, though not
    mmaped, with frequent opens and closes it is still high
    performing because the data is stored in the systems buffer
    cache. The Judy file may benefit from the mmap so its easily
    treated as memory. Finally could you comment again about
    the use of J1C for my uses of expiration, am I correct in
    assuming that to do a daily or other time interval expiration I
    need a JudyL with time values but for just keeping to a max
    number of items in a Judy structure to a specific level I can
    use J1C and count to the end, just like the work load
    example, and am I correct that removing off the first item
    from a Judy1 will always be the oldest item added to it?

     
  • Logged In: NO

    Actually I realize a memcpy to file wouldn't work because the
    data references itself everewhere within the Judy structure.
    Perhaps a followup function would be needed to run through
    the data in linear time to map the pointer addresses to a
    standard spot or offset, like the start of the Judy structure,
    so when its reloaded it can be adjusted again -- I think you
    may have mentioned its quicker to just run the data through
    the JLI insert functions again to repopulate from scratch.
    Currently for my app I keep Judy arrays up to date and in
    sync on a few machines by multicasting packets with small
    insert and delete payloads in them. But, I do use the cdb
    output elsewhere. I'm still tracing down my memory leak and
    related problems, I hope to have that cleared up soon :)
    Thanks for your time,

     
  • Logged In: NO

    I ended up having to run alloc_free() a few times for every
    update of the cdb file, otherwise the few alloc calls that get
    run each time for cdb_make build up over time and consume
    vast amounts of memory. So, my problem memory problem is
    solved and this thread can be closed out, though I am still
    looking forward to variations of that JudyHS output that was
    described here, and potentially other ways of saving Judy
    data to file and back ;)

     
  • Logged In: YES
    user_id=558029

    Dear nobody:
    I still do not know all the requirements of your application.
    So, given what I do know, I have the below suggestion. I do
    not know if the size of your "cache" is in the hundreds,
    thousands or millions. With Gigabyte sizes of RAM, the cache
    can easily be in the 10's of millions -- which is probably
    enough for the entire planet.

    a) I will assume that the size of the "cache" of IP
    addresses is static (fixed in size).

    b) When the cache is full and a new (not present in the cache)
    IP address needs inserting, then you want to "retire" the IP
    the address that was accessed in the most distant past.

    c) You want an up to date version of the "cache" to reside
    in a disk file that may be accessed by other programs.

    Try a memory mapped file of the "cache" ie[100000]. The
    minumum
    data per each entry would be the IP address and a
    "sequence"
    number that indicate the "age" of the last access to that IP
    address.

    typedef struct IP_ENTRY {
    Word_t ie_IP;
    Word_t ie_sequence;
    xxx_t ie_anything_else_you_need;
    } ie_t, *Pie_t;

    ie_t ie[100000]; // a static cache of IP's

    Next use 2 JudyL arrays:

    1) maps IP to the index into ie[]. Note you must not use the
    first entry -- ie[0] entry, because you will need the zero to
    tell if it is a new IP address to be inserted (from JudyLIns()).

    2) maps sequence number into ie[];

    Everytime an IP is encountered the sequence number is
    updated and
    if the cache is full, the oldest sequence number is "retired"
    from the
    2nd JudyL array. The first JudyL array would find the entry
    into ie[],
    given the IP address and the second JudyL array would tell
    you the
    oldest entry when it is necessary to "retire" one.

    I have left out some of the details, but I hope you get the idea.
    The sequence number will eventually roll over, so that needs
    a little more thinking.

    Doug Baskins

     
  • Troy Heber
    Troy Heber
    2005-02-09

    • status: open --> closed