Re: [Comp-hist-devel] C parser

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

----- Original Message -----
From: Hans B Pufal <ha...@di...>
To: <com...@li...>
Sent: Thursday, September 07, 2000 2:14 AM
Subject: Re: [Comp-hist-devel] C parser

> Yes I agree that seek_node is not very fast, especially when the number
> of nodes increases! I think it would be better to find a better search
> algorithm though, the check for a duplicate node is important since we
> could lose information if a duplicate node occurs.
>
I think this search should only be done in a debug mode to ensure we don't
lose information but I think it should be disabled by default : in a
'normal' use, we can assume the data files are error free and disable this
check.

> I think it was your idea to use b-trees to improve search time.

I modified your code to use bst. It is noticeably faster when I parse
dump1.txt several times : it only takes 7 seconds to parse it 32 times (it
took at least 20s with a list). It could be even more efficient if the tree
was balanced but it is quite painful to implement. Currently, since the node
ids are often sorted, the tree we get is probably poorly balanced. Maybe we
should have a look at glib (from gtk) which implements balanced search
trees.

I hope I didn't break anything in parser.c. I did not look at html.c so it
should not work anymore

> >         in parse_files, when the parser processes LINK fields and
> > turns the strings in struct Nodelink *, it now frees the memory which
> > was occupied by the string
>
> No! The string was not malloc'd, so freeing it is an error. All memory
> allocation is done as blocks so as to minimize the calls on malloc and
> to better use memeory (fewer calls to malloc = lower block allocation
> overhead).
>
Indeed it is a mistake. When we the LINK fields could only cope with one
link per node, it was possible to use this free but I forgot you modify the
memory allocation since then.

Also, I don't really see the point of preserving the order of the file in
the list we create for multiple fields specifications. It would be easier
and faster to reverse the order of the file and, in my opinion, it wouldn't
change anything. However, if we want to preserve the order of the file and
to be more efficient (if the list has many elements), we should create the
lists in reverse order and then, mirror them just before creating a new
node. Currently, it does not matter because the lists have at most 3
elements but it may be useful one day.

I also found a typo in dump1.txt : in the QL node, we have an Ingo node
instead of an Info

    Christophe
>
>  -- Hans
> _______________________________________________
> Comp-hist-devel mailing list
> Com...@li...
> http://lists.sourceforge.net/mailman/listinfo/comp-hist-devel