Re: [Htmlparser-user] unable to parse a page from ncbi
Brought to you by:
derrickoswald
From: Ian M. <ian...@gm...> - 2007-01-05 11:47:57
|
The XML you see in your browser isn't actually XML - it's HTML-encoded XML. Therefore it's actually text. So: - Parse the document in HTML Parser, look for that div, then look for the text nodes within the div. - You now have the XML as HTML-encoded text, and you have to convert it into XML. You can convert it in a number of ways, but the easiest would be to just replace the strings %lt; and > with < and >. -You'll now have XML, use an XML parser. HTMLParser might be able to handle it - what you could do is register the various XML tags in there as CompositeTags in the PrototypicalNodeFactory to make it easier to deal with. Ian On 1/4/07, Jay Bhavsar <kin...@gm...> wrote: > Hey guys, > I have looked through all the examples and javadocs but I am still > unsuccessful. Here is want I am tying to do > > I would like to follow a link like the following. > > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&qty=1&c_start=1&list_uids=43366978&uids=&dopt=xml&dispmax=5&sendto=&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256 > > This displays a XML format of a report. I want to parse the XML > section out of the web page and then parse 3-4 tag from the XML > section. The text in XML format is in between <div > class='recordbody'> ... </div> tag. I was using > > ------------------ > NodeList divs = list.extractAllNodesThatMatch (new TagNameFilter ("TITLE")); > > NodeIterator i = divs.elements(); > > while (i.hasMoreNodes()){ > System.out.println("has more nodes"); > processMyNodes(i.nextNode()); > } > -------------------- > > based on the example from the javadocs. But anything other than > HTML in TagNameFilter returns nothing in divs.elements(). (It never > prints "has more nodes") > > Can anyone help extract the XML part from this web page? or is there a > way I can directly extract what I need from this site without saving > the XML part first and then using SAX XMLParser to extract it? > > Note everything between <div class="recordbody">...</div> is text , it > not a xml document. If you view the source you will see what I mean, > (sorry I don't mean to insult any one's intelligence but I just want > to be through about my problem.) > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |