[Htmlparser-user] [operations with source code of a web page]
Brought to you by:
derrickoswald
|
From: myer <my...@o2...> - 2006-02-08 11:12:49
|
Hello dear users and developers,
currently I write my bachelor's thesis, where I use the
functionality of HTML Parser. In my program I need almost the same
result as SiteCapturer does. So I've started to learn how it works
and change it for my project. But some moments are not fully clear
to me.
01) How does HTML Parser obtain source code of a web page before
parsing? In the following I will speak about the SiteCapturer
example. Does it start with a 'null' filter to get all the nodes of a
web page for the very first time? And only then applies other
filters indicated by user. Or it parses 'on the fly': gets the first
node of a source, compares with node filter, and only if it
successfully passes the filter check saves it into a data structure.
Say, node list. What I need to do, is to get the whole 'untouched'
source code of a web page before parsing. Should I go the way
mentioned in this thread
http://sourceforge.net/forum/message.php?msg_id=3005740
or there are any other more intelligent solutions? Perhaps there
exist any already implemented method? Something like
page.getSource()? How does the SiteCapturer solve this problem?
02) Is the source code of a web page normalized anyhow before the
actual parsing? Are there any attempts made to supply a parser with
a validated HTML source? Or is it better to use products of other
developers, e.g. JTidy?
03) Also I would like to save the source code of a web page in its
original encoding or in Unicode. I do not want to lose any
international character of the source. I need to save the source of a
page into the database and be able to obtain it in its original form
if necessary. Does HTML Parser supports source code convertion into
Unicode?
--
Best regards, Myer
|