Re: [Htmlparser-user] Tag Nodes not getting recognized...Please Help
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2007-07-31 20:05:18
|
No, sorry, I can't do your job for you. A standard Java InputStreamReader takes the encoding as a constructor argument. I suggest trying "UTF-8". If you don't want to turn the file into a String first, the Page class in the lexer package has a similar constructor: Page (InputStream stream, String charset) You can pass the page into the Lexer and thence on to the Parser with something like: new Parser (new Lexer (new Page (mystream, "UTF-8"))): ----- Original Message ---- From: k <km...@re...> To: htm...@li... Sent: Tuesday, July 31, 2007 1:39:51 PM Subject: [Htmlparser-user] Re :Re: Re :Re: Tag Nodes not getting recognized...Please Help hi Derrick, thanks very much again. I have tried with ISO-8859-1, but no luck. The original html file is with Unicode(probably UTF-8). I have tried many many ways....and I was not able to do it.... could you please once try htmlParser on my html file and advice me with any help...i know it takes your valueble time...it will be very helpful to me. I am attaching the file again. Kumar. On Mon, 30 Jul 2007 15:33:22 -0700 (PDT) htmlparser user list wrote The ISO-8859-1 encoding contains ASCII, you might try that. If there aren't any funny characters in the file it should work OK. ----- Original Message ---- From: k To: htm...@li... Sent: Monday, July 30, 2007 8:24:07 AM Subject: [Htmlparser-user] Re :Re: Tag Nodes not getting recognized...Please Help Thanks a ton Derrick, for your message, your help is highly appreciable. I have tried earlier using parser.setEncoding("UTF-8"), but it was also not working as expected. Today I have tried getting the content of the file in a string using, parser.setInputHTML(getContentsAsString(testFile)). But it also did not work. The only way it worked is that, if I open the HTML file outside in TextPad and saved it again with Encoding 'ANSI', and then running my code with this new file. Could you please suggest a way that I can do the above using htmlParser or by any other means? I tried reading the file a line at a time and using the following for the conversion. byte[] stringBytesUTF = line.getBytes("UTF-8"); ansiString = new String(stringBytesUTF, "ANSI") But it seems ANSI is not a valid argument. Any advice in this respect is highly valueble to me. Thanking You, Kumar. On Sat, 28 Jul 2007 13:48:36 -0700 (PDT) htmlparser user list wrote It appears the file is unicode, probably UTF-8, so you'll need to get the contents as a string yourself, or try parser.setEncoding ("UTF-8") before performing the parse. Some operating systems support a bye order mask prefix (like 0xFEFF) within the file to identify such files as other than plain ascii. ----- Original Message ---- From: k To: htm...@li... Sent: Saturday, July 28, 2007 8:12:19 AM Subject: [Htmlparser-user] Tag Nodes not getting recognized...Please Help Hi All, First of all thanks very much for your precious time. I hope I will get help from here, as I have no other way. For more than 2 days, I was trying to parse (and process all nodes) one of my HTML file using differnt parsers available. But I was not able to get the Tag Nodes list only for this particular HTML file. When I tried to process this HTML file with HtmlPraser, it was not detecting the TagNodes, it was just detecting the whole html page as one TextNode. But when I try with other simple HTML files, it does detect TagNodes. Please kindly help me out from this issue. Not sure if my HTML file charecter set is different ? Or Should I choose any encoding options ? Here is my code: Also Attached is my HTML file.It has images but I am not attaching them. parser = new Parser("atest.htm"); for (NodeIterator i = parser.elements(); i.hasMoreNodes();){ processMyNodes(i.nextNode()); } static void processMyNodes (Node node) throws ParserException { if (node instanceof TextNode) {e TextNode text = (TextNode)node; System.out.println (text.getText ()); } if (node instanceof RemarkNode) { RemarkNode remark = (RemarkNode)node; } else if (node instanceof TagNode) { TagNode tag = (TagNode)node; NodeList nl = tag.getChildren (); if (null != nl) for (NodeIterator i = nl.elements (); i.hasMoreNodes (); ) processMyNodes (i.nextNode ()); } } Kumar. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/_______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/_______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |