[Htmlparser-user] [operations with source code of a web page]

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello dear users and developers,

  currently I write my bachelor's thesis, where I use the
  functionality of HTML Parser. In my program I need almost the same
  result as SiteCapturer does. So I've started to learn how it works
  and change it for my project. But some moments are not fully clear
  to me.

  01) How does HTML Parser obtain source code of a web page before
  parsing? In the following I will speak about the SiteCapturer
  example. Does it start with a 'null' filter to get all the nodes of a
  web page for the very first time? And only then applies other
  filters indicated by user. Or it parses 'on the fly': gets the first
  node of a source, compares with node filter, and only if it
  successfully passes the filter check saves it into a data structure.
  Say, node list. What I need to do, is to get the whole 'untouched'
  source code of a web page before parsing. Should I go the way
  mentioned in this thread
    http://sourceforge.net/forum/message.php?msg_id=3005740
  or there are any other more intelligent solutions? Perhaps there
  exist any already implemented method? Something like
  page.getSource()? How does the SiteCapturer solve this problem?

  02) Is the source code of a web page normalized anyhow before the
  actual parsing? Are there any attempts made to supply a parser with
  a validated HTML source? Or is it better to use products of other
  developers, e.g. JTidy?

  03) Also I would like to save the source code of a web page in its
  original encoding or in Unicode. I do not want to lose any
  international character of the source. I need to save the source of a
  page into the database and be able to obtain it in its original form
  if necessary. Does HTML Parser supports source code convertion into
  Unicode?

-- 
Best regards, Myer