Re: [Htmlparser-user] [operations with source code of a web page]
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-02-08 12:46:39
|
01) Looking at the source code, the SiteCapturer code goes through a NodeIterator, but the Parser.parse (NodeFilter) method with a null filter would do the same thing. // fetch the page and gather the list of nodes mParser.setURL (url); try { list = new NodeList (); for (NodeIterator e = mParser.elements (); e.hasMoreNodes (); ) list.add (e.nextNode ()); // URL conversion occurs in the tags } catch (EncodingChangeException ece) { // fix bug #998195 SiteCatpurer just crashed // try again with the encoding now set correctly // hopefully mPages, mImages, mCopied and mFinished won't be corrupted mParser.reset (); list = new NodeList (); for (NodeIterator e = mParser.elements (); e.hasMoreNodes (); ) list.add (e.nextNode ()); } 02) No validation is done on the page. However, the heuristics built in to the tag parsing will insert terminating nodes (identified by 0 == (tag.getEndPosition () - tag.getStartPosition ())) where end tags are required. 03) After parsing the entire page, the source is available (as characters characters or String) from the Page/Source, which is exposed on the parser as getPage(). Strings in Java are UTF-16 encoded unicode. Any errors in conversion (using the encoding specified by the HTTP header or HTML meta tags) will already have been committed by then. myer wrote: >Hello dear users and developers, > > currently I write my bachelor's thesis, where I use the > functionality of HTML Parser. In my program I need almost the same > result as SiteCapturer does. So I've started to learn how it works > and change it for my project. But some moments are not fully clear > to me. > > 01) How does HTML Parser obtain source code of a web page before > parsing? In the following I will speak about the SiteCapturer > example. Does it start with a 'null' filter to get all the nodes of a > web page for the very first time? And only then applies other > filters indicated by user. Or it parses 'on the fly': gets the first > node of a source, compares with node filter, and only if it > successfully passes the filter check saves it into a data structure. > Say, node list. What I need to do, is to get the whole 'untouched' > source code of a web page before parsing. Should I go the way > mentioned in this thread > http://sourceforge.net/forum/message.php?msg_id=3005740 > or there are any other more intelligent solutions? Perhaps there > exist any already implemented method? Something like > page.getSource()? How does the SiteCapturer solve this problem? > > 02) Is the source code of a web page normalized anyhow before the > actual parsing? Are there any attempts made to supply a parser with > a validated HTML source? Or is it better to use products of other > developers, e.g. JTidy? > > 03) Also I would like to save the source code of a web page in its > original encoding or in Unicode. I do not want to lose any > international character of the source. I need to save the source of a > page into the database and be able to obtain it in its original form > if necessary. Does HTML Parser supports source code convertion into > Unicode? > > > |