Re: [Htmlparser-user] [operations with source code of a web page]

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

01) Looking at the source code, the SiteCapturer code goes through a 
NodeIterator, but the Parser.parse (NodeFilter) method with a null 
filter would do the same thing.
            // fetch the page and gather the list of nodes
            mParser.setURL (url);
            try
            {
                list = new NodeList ();
                for (NodeIterator e = mParser.elements (); 
e.hasMoreNodes (); )
                    list.add (e.nextNode ()); // URL conversion occurs 
in the tags
            }
            catch (EncodingChangeException ece)
            {
                // fix bug #998195 SiteCatpurer just crashed
                // try again with the encoding now set correctly
                // hopefully mPages, mImages, mCopied and mFinished 
won't be corrupted
                mParser.reset ();
                list = new NodeList ();
                for (NodeIterator e = mParser.elements (); 
e.hasMoreNodes (); )
                    list.add (e.nextNode ());
            }

02) No validation is done on the page. However, the heuristics built in 
to the tag parsing will insert terminating nodes (identified by 0 == 
(tag.getEndPosition () - tag.getStartPosition ())) where end tags are 
required.

03) After parsing the entire page, the source is available (as 
characters characters or String) from the Page/Source, which is exposed 
on the parser as getPage(). Strings in Java are UTF-16 encoded unicode. 
Any errors in conversion (using the encoding specified by the HTTP 
header or HTML meta tags) will already have been committed by then.

myer wrote:

>Hello dear users and developers,
>
>  currently I write my bachelor's thesis, where I use the
>  functionality of HTML Parser. In my program I need almost the same
>  result as SiteCapturer does. So I've started to learn how it works
>  and change it for my project. But some moments are not fully clear
>  to me.
>
>  01) How does HTML Parser obtain source code of a web page before
>  parsing? In the following I will speak about the SiteCapturer
>  example. Does it start with a 'null' filter to get all the nodes of a
>  web page for the very first time? And only then applies other
>  filters indicated by user. Or it parses 'on the fly': gets the first
>  node of a source, compares with node filter, and only if it
>  successfully passes the filter check saves it into a data structure.
>  Say, node list. What I need to do, is to get the whole 'untouched'
>  source code of a web page before parsing. Should I go the way
>  mentioned in this thread
>    http://sourceforge.net/forum/message.php?msg_id=3005740
>  or there are any other more intelligent solutions? Perhaps there
>  exist any already implemented method? Something like
>  page.getSource()? How does the SiteCapturer solve this problem?
>
>  02) Is the source code of a web page normalized anyhow before the
>  actual parsing? Are there any attempts made to supply a parser with
>  a validated HTML source? Or is it better to use products of other
>  developers, e.g. JTidy?
>
>  03) Also I would like to save the source code of a web page in its
>  original encoding or in Unicode. I do not want to lose any
>  international character of the source. I need to save the source of a
>  page into the database and be able to obtain it in its original form
>  if necessary. Does HTML Parser supports source code convertion into
>  Unicode?
>
>  
>