Re: [Htmlparser-user] Re :Re: Tag Nodes not getting recognized...Please Help
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2007-07-30 22:33:30
|
The ISO-8859-1 encoding contains ASCII, you might try that. If there aren't any funny characters in the file it should work OK. ----- Original Message ---- From: k <km...@re...> To: htm...@li... Sent: Monday, July 30, 2007 8:24:07 AM Subject: [Htmlparser-user] Re :Re: Tag Nodes not getting recognized...Please Help Thanks a ton Derrick, for your message, your help is highly appreciable. I have tried earlier using parser.setEncoding("UTF-8"), but it was also not working as expected. Today I have tried getting the content of the file in a string using, parser.setInputHTML(getContentsAsString(testFile)). But it also did not work. The only way it worked is that, if I open the HTML file outside in TextPad and saved it again with Encoding 'ANSI', and then running my code with this new file. Could you please suggest a way that I can do the above using htmlParser or by any other means? I tried reading the file a line at a time and using the following for the conversion. byte[] stringBytesUTF = line.getBytes("UTF-8"); ansiString = new String(stringBytesUTF, "ANSI") But it seems ANSI is not a valid argument. Any advice in this respect is highly valueble to me. Thanking You, Kumar. On Sat, 28 Jul 2007 13:48:36 -0700 (PDT) htmlparser user list wrote It appears the file is unicode, probably UTF-8, so you'll need to get the contents as a string yourself, or try parser.setEncoding ("UTF-8") before performing the parse. Some operating systems support a bye order mask prefix (like 0xFEFF) within the file to identify such files as other than plain ascii. ----- Original Message ---- From: k To: htm...@li... Sent: Saturday, July 28, 2007 8:12:19 AM Subject: [Htmlparser-user] Tag Nodes not getting recognized...Please Help Hi All, First of all thanks very much for your precious time. I hope I will get help from here, as I have no other way. For more than 2 days, I was trying to parse (and process all nodes) one of my HTML file using differnt parsers available. But I was not able to get the Tag Nodes list only for this particular HTML file. When I tried to process this HTML file with HtmlPraser, it was not detecting the TagNodes, it was just detecting the whole html page as one TextNode. But when I try with other simple HTML files, it does detect TagNodes. Please kindly help me out from this issue. Not sure if my HTML file charecter set is different ? Or Should I choose any encoding options ? Here is my code: Also Attached is my HTML file.It has images but I am not attaching them. parser = new Parser("atest.htm"); for (NodeIterator i = parser.elements(); i.hasMoreNodes();){ processMyNodes(i.nextNode()); } static void processMyNodes (Node node) throws ParserException { if (node instanceof TextNode) {e TextNode text = (TextNode)node; System.out.println (text.getText ()); } if (node instanceof RemarkNode) { RemarkNode remark = (RemarkNode)node; } else if (node instanceof TagNode) { TagNode tag = (TagNode)node; NodeList nl = tag.getChildren (); if (null != nl) for (NodeIterator i = nl.elements (); i.hasMoreNodes (); ) processMyNodes (i.nextNode ()); } } Kumar. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/_______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |