#14 Impossible to serialize JudyHS

Judy-1.0.0
closed-out-of-date
nobody
None
5
2005-02-09
2004-10-22
Jeff Bowden
No

My application needs to be able to store and load
arrays that are the result of expensive computations.
I was able to do this quite easily with JudyL and
JudySL because of the presence of the First and Next
functions, but since JudyHS lacks these functions it
seems that such arrays cannot possibly be serialized.
Am I missing something? Is it inherent in the design
that walking through the tree is impossible? The order
in which First/Next go through the keys is not
important for this purpose.

Discussion

  • Logged In: NO

    Jeff:
    Excellent comment. The First and Next functions did not
    seem appriate for JudyHS. I would be happy to add the
    "unload" function to JudyHS. It is actually fairly easy --
    start with the JudyHSFreeArray code -- it has to delete all
    entrys. Anyway, please give me the name and sematics of the
    function you want and I should be able to write it in
    several days. (There is a load/unload function in Judy1/L,
    but I was never happy with the sematics, so I never
    documented it, and of course never wrote one for JudyHS --
    yet). Looking forward to getting this unfinished part of
    JudyHS done.

    Doug Baskins <dougbaskins,AT.yahoo.com>

     
  • Jeff Bowden
    Jeff Bowden
    2004-10-23

    Logged In: YES
    user_id=321043

    Well, I was initially thinking something like

    JHSF(PV, PArray, PIndex, Count)
    JHSN(PV, PArray, PIndex, Count)

    but after looking at the implementation I see that this
    might not be the most efficient interface. Perhaps some
    sort of opaque struct could be defined to hold the iteration
    state and be required parameter:

    IterStateStruct IterState;
    JHSF(PV, PArray, PIndex, Count, IterState);
    JHSN(PV, PArray, PIndex, Count, IterState);

     
  • Jeff Bowden
    Jeff Bowden
    2004-10-23

    Logged In: YES
    user_id=321043

    After a little more thought I realize what I suggest is
    still not ideal. For serializing it would be best if there
    was some way to know

    - how long the longest index string
    - how long the next index string in the iteration

    That way you could have the index string be copied into the
    output buffer directly and know when it needs to be flushed.
    The length-of-next-index doesn't need to be exact either,
    it just needs to be an upper bound so that the user can
    efficiently avoid buffer overruns. If the tree were
    guaranteed to be walked in order of longest strings to
    shortest strings then you could avoid having an additional
    call to get the next length and just have the user assume it
    will be <= the current length. So the full API would be:

    Date: 2004-10-23 16:12
    Sender: offense
    Logged In: YES
    user_id=321043

    Well, I was initially thinking something like

    JHSF(PV, PArray, PIndex, Count)
    JHSN(PV, PArray, PIndex, Count)

    but after looking at the implementation I see that this
    might not be the most efficient interface. Perhaps some
    sort of opaque struct could be defined to hold the iteration
    state and be required parameter:

    IterStateStruct IterState;

    JHSLongest(PArray, Count);
    JHSF(PV, PArray, PIndex, Count, IterState);
    JHSN(PV, PArray, PIndex, Count, IterState);

    where it is guaranteed that Count from JHSF will be equal to
    Count from JHSLongest and Count from each call to JHSN will
    be less than or equal Count from the last call to JHSN.

     
  • Jeff Bowden
    Jeff Bowden
    2004-10-23

    Logged In: YES
    user_id=321043

    Had more thoughts: Could eliminate the oddball JHSLongest
    function by using the passed-in value of Count as an upper
    limit. If Count was too small, it comes back modified to
    the actual size required to hold the index string. Usage
    would be something like:

    #include <Judy.h>

    #define INITIAL_BUF 4000
    #define HEAD_SIZE (2 * sizeof(Word_t))

    void serializeHS(int fd, Pcvoid_t array) {
    char *buf;
    Word_t buflen, bufsize, count;
    PWord_t pv;

    bufsize = INITIAL_BUF;
    buf = malloc(bufsize);
    buflen = 0;
    count = 0;

    JHSF(pv, array, buf, count);
    bufsize = max(bufsize * 2, HEAD_SIZE + count);
    buf = malloc(bufsize);
    JHSF(pv, array, buf + HEAD_SIZE, count);
    while (pv != NULL) {
    ((PWord_t)(buf + buflen))[0] = count;
    ((PWord_t)(buf + buflen))[1] = pv[0];
    buflen += count;
    count = bufsize - HEAD_SIZE;
    JHSN(pv, array, buf + buflen + HEAD_SIZE, count);
    if (count > bufsize - HEAD_SIZE) {
    write(fd, buf, buflen);
    if (count + HEAD_SIZE > bufsize) {
    free(buf);
    bufsize = max(bufsize * 2, HEAD_SIZE + count);
    buf = malloc(bufsize);
    }
    buflen = 0;
    JHSN(pv, array, buf + HEAD_SIZE, count);
    }
    }
    if (buflen) {
    write(fd, buf, buflen);
    }
    free(buf);
    }

    Ooops, I forgot IterStruct, but you get the idea.

     
  • Logged In: YES
    user_id=558029

    Dear offense:

    Your code dated 2004-10-23 does not handle the case of
    where the "Value area" may contain a pointer to some data
    you want to save.

    As a comment, JudyHS keeps an array internally of the
    number of each length of string and how many of each they
    are. These are fairly fast access. Perhaps a "walk" routine
    with a callback function that is passed the "string and length"
    and the "Value" pointer from each string?. However, I hate
    callback functions ever since I tried using the c-library qsort
    () and friends.

    Another comment: Inside the test program
    test/StringCompare.c there is a package called "JLHash". It
    is almost as fast as JudyHS, but not as memory efficient. It
    is a very simple implementation -- similar to a hash scheme,
    but it uses a JudyL array as a 2^32 element hash table.
    Walking the JudyL hash table and then walking the (very few)
    synonyms in a linked list is easy code to write. "JLHashIns(),
    JLHashGet(), JLHashDel() and JLHashFreeArray() are
    implemented. I will probably be writting up these routines as
    a "simple" application note.

    Doug Baskins

     
  • Jeff Bowden
    Jeff Bowden
    2004-10-26

    Logged In: YES
    user_id=321043

    Yeah, it wasn't meant to be a general purpose serializer. I
    don't even know if it compiles as I just typed it in
    directly to this web comment box. It was just to show an
    example of how the proposed API could be used.

    I'll take a look at JLHash.

    Thanks.

     
  • Logged In: NO

    Hi.

    I am having a similar issue with JudyHS.

    I am certainly interested in addition of methods to walk the
    tree and whatever other enhacenments there may be
    (perhaps all ideas put in as a separate test file).

    Thanks in advance.

     
  • Logged In: YES
    user_id=558029

    Dear Offense and Nobody:

    I did not put a "serialize" or "unload" function in
    JudyHS() because I did not think of a good way to
    specify how to do that. (Read: the sematics of the
    calling routine(s)). It is fairly easy to get the
    following information from a JudyHS array.

    1: The number of different length of strings (character
    arrays). This number is typically fairly small --
    less than 200, if the maximum length string is less
    than 200 bytes.

    2: The number of strings of each length. If there were 200
    different lengths, then an array of 200 is necessary to
    contain the number (and size) of each length. (For example
    suppose you have 10 strings of length 3, 4500 or length
    4, 1000 of length 50...)

    Off the top of my head, this suggests something like:

    Word_t NLengths; // number of different length strings
    Word_t *PLengths; // array of number of strings of
    each length
    void **PBuffers; // array of pointer to each length

    /* get the number of different lengths of strings */
    NLengths = JudyHSnumberOfLengths(JHSArray);
    if (NLengths == 0) goto Done;

    /* get memory to store number (sorted) each length of string */
    PLengths = (Word_t *)malloc(NLengths * sizeof(Word_t) );
    if (PLengths == NULL) goto MallocError;

    /* and memory to store the buffer addresses of each length */
    PBuffers = (void **)malloc(NLengths * sizeof(Word_t) );
    if (PBuffers == NULL) goto MallocError;

    /* get the array of lengths -- I.E. 200 in above example */
    Error = JudyHSGetNumbOfEachLength(JHSArray, PLengths, NLengths);
    if (Error) goto CallerBug;

    /* get memory of each length of string and fill buffer */
    for (ii = 0; ii < NLengths; ii++)
    {
    Word_t TBytes = PLengths[ii] * ;
    void *Buffer = (void *)malloc(TBytes);

    Error = JudyHSGetStringsOfLength(JHSArray, Buffer, TBytes);
    if (Error) goto CallerBug;

    PBuffers[ii] = Buffer;
    }

    Now we have N buffers full of strings N different lengths. How
    you want to store this buffers on the disk is up to you.

    Notes:
    1) There is still a small problem that the "non-word" length
    strings are not necessarly on a word boundry.
    2) The strings in each buffer are not in any special order
    (hash).

    3) If JudyHS.c is compiled with -DDONOTUSEHASH, then all strings
    will be returned in sorted (numerical and dictionary
    because they are the
    same length) order in the buffers.

    Let me know if you think this is reasonable, and I will
    implement it. Or perhaps you will just laugh. There must
    be some simple way? How do
    people store a (variable number of) strings (non-text) of
    different lengths
    in a file? At least JudyHS() keeps, internally, strings of
    the same length
    in its own table.

    OOPS, I just realized the above code does not tell you what
    the lengths
    of each size of string is and the potential "Values"
    associated with each string. The size of the "Values" are
    only known by the programer that put them there -- not Judy.
    Perhaps sombody will fix that with improved code.
    Then I will implement it. Please, give me a general
    solution, not one that
    just solves your "special" problem. Perhaps a array
    structs (tuples?) of
    1) Buffer pointer,
    2) Length of string,
    3) Number of strings should be returned.

    (Sizeof Buffer is Length * Number) and the number of
    structs (NLength) is
    the number of different lengths of strings.

    Doug Baskins

     
  • Troy Heber
    Troy Heber
    2005-02-09

    • status: open --> closed
     
  • Troy Heber
    Troy Heber
    2005-02-09

    • status: closed --> closed-out-of-date