|
From: Mark H. <ma...@fi...> - 2009-01-29 23:58:20
|
>Mark Hellegers wrote: >> Hi all, >> >> I just checked in a complete rewrite of the HTML parser. >> It is still very unstable, but it can already parse some pages >correctly, >> for example the google homepage. >> I don't think it is very useful for anyone to test it, if you don't >know >> where to find the problem in the code when a page doesn't parse >correctly. >> I know it still gets confused on a lot of pages, causing error >messages, >> or worse infinite loops. >> >> That said, if you are working on another part of Themis (hint hint ;) >and >> you need a particular page parsed correctly, I'll be happy to have a >look >> at what the problem is. >> >> Mark >> >> > :) I noticed the devcvs messages earlier. :) I'm just finishing up a >project for work that's had me tied up for the last two months, so I'll >be getting back to work myself [again] soon. When I was last working with >the code in November, I noticed a few bugs in processing set-cookie >headers on certain sites. (Oddly enough, only on Microsoft owned sites.) >So while it might not be directly related to HTML parsing, I'll be sure >to keep an eye on what happens. Hi Raymond, I have this week off from work, so I could spend to time to get the new HTML parser into a usable state. Took quite a bit more effort than I thought. :) I'm seeing some odd things though: - On some sites I get the request to start parsing twice, bu the second time I don't get any data. I built in a check in the parser to prevent the second parse, but the second request shouldn't happen. I get this on www.osnews.com for example. - I don't get the complete data for themis.sf.net, although I can see that the cache file does have all the data. I might look into it if I have the time, but getting the most serious bugs out of the parser has a higher priority right now. Mark -- Spangalese for beginnners: 'Blu shef farn wahr' 'May I please have my leg back' -- Spangalese for beginnners: 'Blu shef farn wahr' 'May I please have my leg back' |