Re: [Comp-hist-devel] C parser

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Christophe Fergeau wrote:

> > Yes I agree that seek_node is not very fast, especially when the number
> > of nodes increases! I think it would be better to find a better search
> > algorithm though, the check for a duplicate node is important since we
> > could lose information if a duplicate node occurs.
> >
> I think this search should only be done in a debug mode to ensure we don't
> lose information but I think it should be disabled by default : in a
> 'normal' use, we can assume the data files are error free and disable this
> check.

I do not have a problem making it an option. But how often do we run the
parser?
Until the parser becomes part of a CGI script at which point speed would
be an issue I do not see the necessity of making it super fast.

> > I think it was your idea to use b-trees to improve search time.
> 
> I modified your code to use bst. It is noticeably faster when I parse
> dump1.txt several times : it only takes 7 seconds to parse it 32 times (it
> took at least 20s with a list). It could be even more efficient if the tree
> was balanced but it is quite painful to implement. Currently, since the node
> ids are often sorted, the tree we get is probably poorly balanced. Maybe we
> should have a look at glib (from gtk) which implements balanced search
> trees.

How does your B-tree time compare with the time when duplicate checking
is disabled?

> I hope I didn't break anything in parser.c. I did not look at html.c so it
> should not work anymore

What you are doing is internal to the parser, it should nt affect HTML
which simply traverses the linked list produced by parser.

> Also, I don't really see the point of preserving the order of the file in
> the list we create for multiple fields specifications. It would be easier
> and faster to reverse the order of the file and, in my opinion, it wouldn't
> change anything. However, if we want to preserve the order of the file and
> to be more efficient (if the list has many elements), we should create the
> lists in reverse order and then, mirror them just before creating a new
> node. Currently, it does not matter because the lists have at most 3
> elements but it may be useful one day.

Where we talk about links I agree but this feature can also be used for
simple string fields where order may be improtatnt, e.g instead of using
one long text string for Info:, we could use multiple Info: records
making it easier to edit and read.

> I also found a typo in dump1.txt : in the QL node, we have an Ingo node
> instead of an Info

Ah, yes the parser is set to not complain about unknown field names
since there would be many. Can you go into the QL source file and fix it
- should be in the sinclair file.

  -- Hans