[Htmlparser-user] unable to parse a page from ncbi
Brought to you by:
derrickoswald
From: Jay B. <kin...@gm...> - 2007-01-04 23:46:52
|
Hey guys, I have looked through all the examples and javadocs but I am still unsuccessful. Here is want I am tying to do I would like to follow a link like the following. http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&qty=1&c_start=1&list_uids=43366978&uids=&dopt=xml&dispmax=5&sendto=&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256 This displays a XML format of a report. I want to parse the XML section out of the web page and then parse 3-4 tag from the XML section. The text in XML format is in between <div class='recordbody'> ... </div> tag. I was using ------------------ NodeList divs = list.extractAllNodesThatMatch (new TagNameFilter ("TITLE")); NodeIterator i = divs.elements(); while (i.hasMoreNodes()){ System.out.println("has more nodes"); processMyNodes(i.nextNode()); } -------------------- based on the example from the javadocs. But anything other than HTML in TagNameFilter returns nothing in divs.elements(). (It never prints "has more nodes") Can anyone help extract the XML part from this web page? or is there a way I can directly extract what I need from this site without saving the XML part first and then using SAX XMLParser to extract it? Note everything between <div class="recordbody">...</div> is text , it not a xml document. If you view the source you will see what I mean, (sorry I don't mean to insult any one's intelligence but I just want to be through about my problem.) |