[Htmlparser-user] [operations with source code of a web page]
Brought to you by:
derrickoswald
From: myer <my...@o2...> - 2006-02-08 11:12:49
|
Hello dear users and developers, currently I write my bachelor's thesis, where I use the functionality of HTML Parser. In my program I need almost the same result as SiteCapturer does. So I've started to learn how it works and change it for my project. But some moments are not fully clear to me. 01) How does HTML Parser obtain source code of a web page before parsing? In the following I will speak about the SiteCapturer example. Does it start with a 'null' filter to get all the nodes of a web page for the very first time? And only then applies other filters indicated by user. Or it parses 'on the fly': gets the first node of a source, compares with node filter, and only if it successfully passes the filter check saves it into a data structure. Say, node list. What I need to do, is to get the whole 'untouched' source code of a web page before parsing. Should I go the way mentioned in this thread http://sourceforge.net/forum/message.php?msg_id=3005740 or there are any other more intelligent solutions? Perhaps there exist any already implemented method? Something like page.getSource()? How does the SiteCapturer solve this problem? 02) Is the source code of a web page normalized anyhow before the actual parsing? Are there any attempts made to supply a parser with a validated HTML source? Or is it better to use products of other developers, e.g. JTidy? 03) Also I would like to save the source code of a web page in its original encoding or in Unicode. I do not want to lose any international character of the source. I need to save the source of a page into the database and be able to obtain it in its original form if necessary. Does HTML Parser supports source code convertion into Unicode? -- Best regards, Myer |