From: Christophe de V. <cde...@al...> - 2004-03-03 11:02:28
|
Hi Laurent, This looks interesting, I'll have a closer look on it a bit later. However the integration into libxml++ will not occur in 2.6, since we=20 froze the API on monday. We'll soon discuss the start of a new unstable branch, in which the=20 HtmlParser should probably have its place. Thanks, Christophe Laurent Hoss a =E9crit : > Hi all, > > I discovered the cool libxml++ yesterday on my quest for the best C++=20 > XML Parser (bindings, coz libxml2 seems to be the best C parser anyway=20 > ;). Libxml is not new to me though, I used it extensively in Perl=20 > thanx to the very complete XML::LibXML CPAN Module. > Now one of my main motivations is to parse HTML Files into a DOM tree=20 > where I can extract nodes with XPATH. > In perl that was easy , it has the html parser included. > Therefore after a thorough search in the API I was a bit disappointed=20 > that there was no HTML Parser support in libxml++... > but thanks to the clean API's of libxml(++) and after a little=20 > reading , I had no difficulties at all building my own subclass (based=20 > on domparser.cc) except some little quirks (like extra encoding=20 > parameter in some html parser functions) :) > > In fact libxml2 has a really tolerant html parser (I used it in perl=20 > for mirroring/parsing whole dynamic websites :D ), it even returns a=20 > good XML Document when it had parser Errors, but to get a Doc returned=20 > in such a case one has to turn off the 'wellformedness' check, which I=20 > did in my temporary htmlparser Implementation. > ( Unfort. there's always a segfault at the end of a run of my edited=20 > 'dom_xpath/main.cc' html parsing example app , when ignoring=20 > '!context_->wellFormed' ?! experimenting done in=20 > 'HtmlParser::parse_context' method ) > > I hope HTML Parsing can be included in the main distr. ( maybe better=20 > with wellFormed check on )... > To compile the whole library with my htmlparser class, I added the=20 > class in all the files (Makefile.am files, libxml++.h...) containing=20 > 'domparser'. > > Included are the c++ and include files of htmlparser class (or should=20 > I've taken diffs from the domparser.cc/h originals ?) plus my html=20 > parsing example, which shows all the //a[@href] links with their=20 > attribute contents. > > Hopefully the segfault can be easily solved with the knowledge of the=20 > lead developpers ( I don't have yet ;). > I guess its just something I'm missing, else I'll try to find the=20 > mem.leak using a debugger (or is there a better way ??) > > Thanx, > Laurent > |