General purpose dynamic array - Judy / Bugs / #14 Impossible to serialize JudyHS

Nobody/Anonymous - 2004-10-23

Logged In: NO

Jeff:
Excellent comment. The First and Next functions did not
seem appriate for JudyHS. I would be happy to add the
"unload" function to JudyHS. It is actually fairly easy --
start with the JudyHSFreeArray code -- it has to delete all
entrys. Anyway, please give me the name and sematics of the
function you want and I should be able to write it in
several days. (There is a load/unload function in Judy1/L,
but I was never happy with the sematics, so I never
documented it, and of course never wrote one for JudyHS --
yet). Looking forward to getting this unfinished part of
JudyHS done.

Doug Baskins <dougbaskins,AT.yahoo.com>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Bowden - 2004-10-23

Logged In: YES
user_id=321043

Well, I was initially thinking something like

JHSF(PV, PArray, PIndex, Count)
JHSN(PV, PArray, PIndex, Count)

but after looking at the implementation I see that this
might not be the most efficient interface. Perhaps some
sort of opaque struct could be defined to hold the iteration
state and be required parameter:

IterStateStruct IterState;
JHSF(PV, PArray, PIndex, Count, IterState);
JHSN(PV, PArray, PIndex, Count, IterState);

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Bowden - 2004-10-23

Logged In: YES
user_id=321043

After a little more thought I realize what I suggest is
still not ideal. For serializing it would be best if there
was some way to know

- how long the longest index string
- how long the next index string in the iteration

That way you could have the index string be copied into the
output buffer directly and know when it needs to be flushed.
The length-of-next-index doesn't need to be exact either,
it just needs to be an upper bound so that the user can
efficiently avoid buffer overruns. If the tree were
guaranteed to be walked in order of longest strings to
shortest strings then you could avoid having an additional
call to get the next length and just have the user assume it
will be <= the current length. So the full API would be:

Date: 2004-10-23 16:12
Sender: offense
Logged In: YES
user_id=321043

Well, I was initially thinking something like

JHSF(PV, PArray, PIndex, Count)
JHSN(PV, PArray, PIndex, Count)

but after looking at the implementation I see that this
might not be the most efficient interface. Perhaps some
sort of opaque struct could be defined to hold the iteration
state and be required parameter:

IterStateStruct IterState;

JHSLongest(PArray, Count);
JHSF(PV, PArray, PIndex, Count, IterState);
JHSN(PV, PArray, PIndex, Count, IterState);

where it is guaranteed that Count from JHSF will be equal to
Count from JHSLongest and Count from each call to JHSN will
be less than or equal Count from the last call to JHSN.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Bowden - 2004-10-23

Logged In: YES
user_id=321043

Had more thoughts: Could eliminate the oddball JHSLongest
function by using the passed-in value of Count as an upper
limit. If Count was too small, it comes back modified to
the actual size required to hold the index string. Usage
would be something like:

#include <Judy.h>

#define INITIAL_BUF 4000
#define HEAD_SIZE (2 * sizeof(Word_t))

void serializeHS(int fd, Pcvoid_t array) {
char *buf;
Word_t buflen, bufsize, count;
PWord_t pv;

bufsize = INITIAL_BUF;
buf = malloc(bufsize);
buflen = 0;
count = 0;

JHSF(pv, array, buf, count);
bufsize = max(bufsize * 2, HEAD_SIZE + count);
buf = malloc(bufsize);
JHSF(pv, array, buf + HEAD_SIZE, count);
while (pv != NULL) {
((PWord_t)(buf + buflen))[0] = count;
((PWord_t)(buf + buflen))[1] = pv[0];
buflen += count;
count = bufsize - HEAD_SIZE;
JHSN(pv, array, buf + buflen + HEAD_SIZE, count);
if (count > bufsize - HEAD_SIZE) {
write(fd, buf, buflen);
if (count + HEAD_SIZE > bufsize) {
free(buf);
bufsize = max(bufsize * 2, HEAD_SIZE + count);
buf = malloc(bufsize);
}
buflen = 0;
JHSN(pv, array, buf + HEAD_SIZE, count);
}
}
if (buflen) {
write(fd, buf, buflen);
}
free(buf);
}

Ooops, I forgot IterStruct, but you get the idea.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Douglas L. Baskins - 2004-10-26

Logged In: YES
user_id=558029

Dear offense:

Your code dated 2004-10-23 does not handle the case of
where the "Value area" may contain a pointer to some data
you want to save.

As a comment, JudyHS keeps an array internally of the
number of each length of string and how many of each they
are. These are fairly fast access. Perhaps a "walk" routine
with a callback function that is passed the "string and length"
and the "Value" pointer from each string?. However, I hate
callback functions ever since I tried using the c-library qsort
() and friends.

Another comment: Inside the test program
test/StringCompare.c there is a package called "JLHash". It
is almost as fast as JudyHS, but not as memory efficient. It
is a very simple implementation -- similar to a hash scheme,
but it uses a JudyL array as a 2^32 element hash table.
Walking the JudyL hash table and then walking the (very few)
synonyms in a linked list is easy code to write. "JLHashIns(),
JLHashGet(), JLHashDel() and JLHashFreeArray() are
implemented. I will probably be writting up these routines as
a "simple" application note.

Doug Baskins

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Bowden - 2004-10-26

Logged In: YES
user_id=321043

Yeah, it wasn't meant to be a general purpose serializer. I
don't even know if it compiles as I just typed it in
directly to this web comment box. It was just to show an
example of how the proposed API could be used.

I'll take a look at JLHash.

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2004-12-05

Logged In: NO

Hi.

I am having a similar issue with JudyHS.

I am certainly interested in addition of methods to walk the
tree and whatever other enhacenments there may be
(perhaps all ideas put in as a separate test file).

Thanks in advance.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Douglas L. Baskins - 2004-12-08

Logged In: YES
user_id=558029

Dear Offense and Nobody:

I did not put a "serialize" or "unload" function in
JudyHS() because I did not think of a good way to
specify how to do that. (Read: the sematics of the
calling routine(s)). It is fairly easy to get the
following information from a JudyHS array.

1: The number of different length of strings (character
arrays). This number is typically fairly small --
less than 200, if the maximum length string is less
than 200 bytes.

2: The number of strings of each length. If there were 200
different lengths, then an array of 200 is necessary to
contain the number (and size) of each length. (For example
suppose you have 10 strings of length 3, 4500 or length
4, 1000 of length 50...)

Off the top of my head, this suggests something like:

Word_t NLengths; // number of different length strings
Word_t *PLengths; // array of number of strings of
each length
void **PBuffers; // array of pointer to each length

/* get the number of different lengths of strings */
NLengths = JudyHSnumberOfLengths(JHSArray);
if (NLengths == 0) goto Done;

/* get memory to store number (sorted) each length of string */
PLengths = (Word_t *)malloc(NLengths * sizeof(Word_t) );
if (PLengths == NULL) goto MallocError;

/* and memory to store the buffer addresses of each length */
PBuffers = (void **)malloc(NLengths * sizeof(Word_t) );
if (PBuffers == NULL) goto MallocError;

/* get the array of lengths -- I.E. 200 in above example */
Error = JudyHSGetNumbOfEachLength(JHSArray, PLengths, NLengths);
if (Error) goto CallerBug;

/* get memory of each length of string and fill buffer */
for (ii = 0; ii < NLengths; ii++)
{
Word_t TBytes = PLengths[ii] * ;
void *Buffer = (void *)malloc(TBytes);

Error = JudyHSGetStringsOfLength(JHSArray, Buffer, TBytes);
if (Error) goto CallerBug;

PBuffers[ii] = Buffer;
}

Now we have N buffers full of strings N different lengths. How
you want to store this buffers on the disk is up to you.

Notes:
1) There is still a small problem that the "non-word" length
strings are not necessarly on a word boundry.
2) The strings in each buffer are not in any special order
(hash).

3) If JudyHS.c is compiled with -DDONOTUSEHASH, then all strings
will be returned in sorted (numerical and dictionary
because they are the
same length) order in the buffers.

Let me know if you think this is reasonable, and I will
implement it. Or perhaps you will just laugh. There must
be some simple way? How do
people store a (variable number of) strings (non-text) of
different lengths
in a file? At least JudyHS() keeps, internally, strings of
the same length
in its own table.

OOPS, I just realized the above code does not tell you what
the lengths
of each size of string is and the potential "Values"
associated with each string. The size of the "Values" are
only known by the programer that put them there -- not Judy.
Perhaps sombody will fix that with improved code.
Then I will implement it. Please, give me a general
solution, not one that
just solves your "special" problem. Perhaps a array
structs (tuples?) of
1) Buffer pointer,
2) Length of string,
3) Number of strings should be returned.

(Sizeof Buffer is Length * Number) and the number of
structs (NLength) is
the number of different lengths of strings.

Doug Baskins

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Troy Heber - 2005-02-09

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Troy Heber - 2005-02-09

status: closed --> closed-out-of-date
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Impossible to serialize JudyHS

Group

Searches

Help

#14 Impossible to serialize JudyHS

Discussion