01) Looking at the source code, the SiteCapturer code goes through a
NodeIterator, but the Parser.parse (NodeFilter) method with a null
filter would do the same thing.
// fetch the page and gather the list of nodes
mParser.setURL (url);
try
{
list = new NodeList ();
for (NodeIterator e = mParser.elements ();
e.hasMoreNodes (); )
list.add (e.nextNode ()); // URL conversion occurs
in the tags
}
catch (EncodingChangeException ece)
{
// fix bug #998195 SiteCatpurer just crashed
// try again with the encoding now set correctly
// hopefully mPages, mImages, mCopied and mFinished
won't be corrupted
mParser.reset ();
list = new NodeList ();
for (NodeIterator e = mParser.elements ();
e.hasMoreNodes (); )
list.add (e.nextNode ());
}
02) No validation is done on the page. However, the heuristics built in
to the tag parsing will insert terminating nodes (identified by 0 ==
(tag.getEndPosition () - tag.getStartPosition ())) where end tags are
required.
03) After parsing the entire page, the source is available (as
characters characters or String) from the Page/Source, which is exposed
on the parser as getPage(). Strings in Java are UTF-16 encoded unicode.
Any errors in conversion (using the encoding specified by the HTTP
header or HTML meta tags) will already have been committed by then.
myer wrote:
>Hello dear users and developers,
>
> currently I write my bachelor's thesis, where I use the
> functionality of HTML Parser. In my program I need almost the same
> result as SiteCapturer does. So I've started to learn how it works
> and change it for my project. But some moments are not fully clear
> to me.
>
> 01) How does HTML Parser obtain source code of a web page before
> parsing? In the following I will speak about the SiteCapturer
> example. Does it start with a 'null' filter to get all the nodes of a
> web page for the very first time? And only then applies other
> filters indicated by user. Or it parses 'on the fly': gets the first
> node of a source, compares with node filter, and only if it
> successfully passes the filter check saves it into a data structure.
> Say, node list. What I need to do, is to get the whole 'untouched'
> source code of a web page before parsing. Should I go the way
> mentioned in this thread
> http://sourceforge.net/forum/message.php?msg_id=3005740
> or there are any other more intelligent solutions? Perhaps there
> exist any already implemented method? Something like
> page.getSource()? How does the SiteCapturer solve this problem?
>
> 02) Is the source code of a web page normalized anyhow before the
> actual parsing? Are there any attempts made to supply a parser with
> a validated HTML source? Or is it better to use products of other
> developers, e.g. JTidy?
>
> 03) Also I would like to save the source code of a web page in its
> original encoding or in Unicode. I do not want to lose any
> international character of the source. I need to save the source of a
> page into the database and be able to obtain it in its original form
> if necessary. Does HTML Parser supports source code convertion into
> Unicode?
>
>
>
|