Re: [Htmlparser-user] unable to parse a page from ncbi

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

The XML you see in your browser isn't actually XML - it's HTML-encoded
XML. Therefore it's actually text. So:

- Parse the document in HTML Parser, look for that div, then look for
the text nodes within the div.
- You now have the XML as HTML-encoded text, and you have to convert
it into XML. You can convert it in a number of ways, but the easiest
would be to just replace the strings %lt; and &gt; with < and >.
-You'll now have XML, use an XML parser. HTMLParser might be able to
handle it - what you could do is register the various XML tags in
there as CompositeTags in the PrototypicalNodeFactory to make it
easier to deal with.

Ian

On 1/4/07, Jay Bhavsar <kin...@gm...> wrote:
> Hey guys,
>   I have looked through all the examples and javadocs but I am still
> unsuccessful.  Here is want I am tying to do
>
> I would like to follow a link like the following.
>
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&qty=1&c_start=1&list_uids=43366978&uids=&dopt=xml&dispmax=5&sendto=&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256
>
> This displays a XML format of a report.  I want to parse the XML
> section out of the web page and then parse 3-4 tag from the XML
> section.  The text in XML format is in between <div
> class='recordbody'> ... </div> tag.  I was using
>
> ------------------
> NodeList divs = list.extractAllNodesThatMatch (new TagNameFilter ("TITLE"));
>
>                 NodeIterator i = divs.elements();
>
>                 while (i.hasMoreNodes()){
>                      System.out.println("has more nodes");
>                      processMyNodes(i.nextNode());
>                 }
> --------------------
>
> based on the example from the javadocs.  But anything other than
> HTML in TagNameFilter returns nothing in divs.elements().  (It never
> prints "has more nodes")
>
> Can anyone help extract the XML part from this web page? or is there a
> way I can directly extract what I need from this site without saving
> the XML part first and then using SAX XMLParser to extract it?
>
> Note everything between <div class="recordbody">...</div> is text , it
> not a xml document.  If you view the source you will see what I mean,
> (sorry I don't mean to insult any one's intelligence but I just want
> to be through about my problem.)
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>