This has been a problem in the past. The encoding/charset for the page is not supported or unknown in Java. The solution that probably will be adopted is to provide a static accessor pair on the Page class to set/get the default charset so the fallback character set is the correct one. You can code this yourself...
In the Page class, add a static class variable, initialized to the original default:
If there isn't an error saying it couldn't open the file, it's found the file and is processing it.
If so there are nodes being returned.
If there isn't any output I would check the logic of your visitor, assuming you are using one like in your original post.
Use a debugger and break on visitTag() or visitStringNode() as appropriate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
my code
Parser parser = new Parser ("http://www.google.co.th");
parser.setEncoding("tis-620");
TextExtractingVisitor visitor = new TextExtractingVisitor ();
parser.visitAllNodesWith (visitor);
System.out.println (visitor.getExtractedText());
this websit present thai language
in this site use charset=windows-874
but not support in java
in error code after run is
unable to determine cannonical charset name for windows-874 - using ISO-8859-1
This has been a problem in the past. The encoding/charset for the page is not supported or unknown in Java. The solution that probably will be adopted is to provide a static accessor pair on the Page class to set/get the default charset so the fallback character set is the correct one. You can code this yourself...
In the Page class, add a static class variable, initialized to the original default:
static String mDefaultCharset = DEFAULT_CHARSET;
Create accessor methods to get and set it:
static void setDefaultCharset (String charset) { mDefaultCharset = charset; }
static String getDefaultCharset () { return (mDefaultCharset); }
Then use this accessor in the getCharset method (line 259?):
ret = getDefaultCharset (); // was DEFAULT_CHARSET
Rebuild the htmlparser.jar (ant task: jar & other building instructions).
Then, set up the default in your program:
// parser.setEncoding("tis-620");
Page.setDefaultCharset ("tis-620");
The error message will still be generated, but now it should say:
unable to determine cannonical charset name for windows-874 - using tis-620
Of course you need to use the correct cannonical name for the character set you want, which may not be "tis-620".
If this works for you, let us know so it can be incorporated as a permanent solution.
Thanks for your answer.
It's work, but when I replace url with my html file in local drive. It doesn't respond anything.
Could you give me another suggest.
If there isn't an error saying it couldn't open the file, it's found the file and is processing it.
If so there are nodes being returned.
If there isn't any output I would check the logic of your visitor, assuming you are using one like in your original post.
Use a debugger and break on visitTag() or visitStringNode() as appropriate.
Thanks very much for your help