Thread: [Htmlparser-user] unable to parse a page from ncbi

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hey guys,
  I have looked through all the examples and javadocs but I am still
unsuccessful.  Here is want I am tying to do

I would like to follow a link like the following.

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&qty=1&c_start=1&list_uids=43366978&uids=&dopt=xml&dispmax=5&sendto=&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256

This displays a XML format of a report.  I want to parse the XML
section out of the web page and then parse 3-4 tag from the XML
section.  The text in XML format is in between <div
class='recordbody'> ... </div> tag.  I was using

------------------
NodeList divs = list.extractAllNodesThatMatch (new TagNameFilter ("TITLE"));

                NodeIterator i = divs.elements();

                while (i.hasMoreNodes()){
                     System.out.println("has more nodes");
                     processMyNodes(i.nextNode());
                }
--------------------

based on the example from the javadocs.  But anything other than
HTML in TagNameFilter returns nothing in divs.elements().  (It never
prints "has more nodes")

Can anyone help extract the XML part from this web page? or is there a
way I can directly extract what I need from this site without saving
the XML part first and then using SAX XMLParser to extract it?

Note everything between <div class="recordbody">...</div> is text , it
not a xml document.  If you view the source you will see what I mean,
(sorry I don't mean to insult any one's intelligence but I just want
to be through about my problem.)

Thread: [Htmlparser-user] unable to parse a page from ncbi

htmlparser-user