[Htmlparser-user] Re :Re: Tag Nodes not getting recognized...Please Help
Brought to you by:
derrickoswald
From: k <km...@re...> - 2007-07-30 12:25:45
|
Thanks a ton Derrick, for your message, your help is highly appreciable. I have tried earlier using parser.setEncoding(\"UTF-8\"), but it was also not working as expected. Today I have tried getting the content of the file in a string using, parser.setInputHTML(getContentsAsString(testFile)). But it also did not work. The only way it worked is that, if I open the HTML file outside in TextPad and saved it again with Encoding \'ANSI\', and then running my code with this new file. Could you please suggest a way that I can do the above using htmlParser or by any other means? I tried reading the file a line at a time and using the following for the conversion.byte[] stringBytesUTF = line.getBytes(\"UTF-8\");ansiString = new String(stringBytesUTF, \"ANSI\")But it seems ANSI is not a valid argument. Any advice in this respect is highly valueble to me. Thanking You,Kumar.On Sat, 28 Jul 2007 13:48:36 -0700 (PDT) htmlparser user list wroteIt appears the file is unicode, probably UTF-8, so you\'ll need to get the contents as a string yourself, or try parser.setEncoding (\"UTF-8\") before performing the parse. Some operating systems support a bye order mask prefix (like 0xFEFF) within the file to identify such files as other than plain ascii.----- Original Message ----From: k To: htm...@li...Sent: Saturday, July 28, 2007 8:12:19 AMSubject: [Htmlparser-user] Tag Nodes not getting recognized...Please HelpHi All, First of all thanks very much for yourprecious time. I hope I will get help from here, as I have no other way. For more than 2 days, I was trying to parse (and process all nodes) one of my HTML file using differnt parsers available. But I was not able to get the Tag Nodes list only for this particular HTML file. When I tried to process this HTML file with HtmlPraser, it was not detecting the TagNodes, it was just detecting the whole html page as one TextNode.But when I try with other simple HTML files, it does detect TagNodes. Please kindly help me out from this issue. Not sure if my HTML file charecter set is different ? Or Should I choose any encoding options ?Here is my code: Also Attached is my HTML file.It has images but I am not attaching them. parser = new Parser(\"atest.htm\"); for (NodeIterator i = parser.elements();i.hasMoreNodes();){ processMyNodes(i.nextNode()); } static void processMyNodes (Node node) throws ParserException { if (node instanceof TextNode) {e TextNode text = (TextNode)node; System.out.println (text.getText ()); } if (node instanceof RemarkNode) { RemarkNode remark = (RemarkNode)node; } else if (node instanceof TagNode) { TagNode tag = (TagNode)node; NodeList nl = tag.getChildren (); if (null != nl) for (NodeIterator i =nl.elements (); i.hasMoreNodes (); ) processMyNodes (i.nextNode ()); } }Kumar.-------------------------------------------------------------------------This SF.net email is sponsored by: Splunk Inc.Still grepping through log files to findproblems? Stop.Now Search log events and configuration files using AJAX and a browser.Download your FREE copy of Splunk now >> http://get.splunk.com/_______________________________________________Htmlparser-user mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlparser-user |