Rupanu,

I'm not sure where your problem lies. The exception was raised because the encoding of the stream didn't agree with the stated contents of the HTML within it. The code in ConnectionManager that opens a disk file - URLConnection openConnection (String string) - uses the override - URLConnection openConnection (URL url) - with the url being the file name prefixed by "file://localhost".

So it's up to the JVM and operating system to figure out the encoding of the text file on disk. Apparently, the file was not written with the correct encoding bytes at the beginning of the file or something, so this couldn't be figured out and it was opened with ISO-8859-1 instead of UTF8 encoding.

To fix it, the text file of HTML needs to be written differently, or you need to open it differently using perhaps your own stream passed to the Page constructor.

Derrick

----- Original Message ----
From: Rupanu Ranjaneswar <rupanu_pal@yahoo.com>
To: htmlparser-user@lists.sourceforge.net
Sent: Wednesday, September 26, 2007 2:07:27 AM
Subject: [Htmlparser-user] Encoding issue

Hello there,

Well, I copied and pasted the code you gave but there seems to be an issue with encoding.I am trying to read from a non-unicode htm/html file and extract its contents and write them into a text file.
Here's the code
*********************************
String inputfile = args[0];
          Parser parser = new Parser (inputfile);
          StringBean sb = new StringBean ();
          parser.visitAllNodesWith (sb);
            String content = sb.getStrings();
            String outputfilename= "E:\\outputfile.txt";           
            OutputStreamWriter osw= new OutputStreamWriter(new FileOutputStream(outputfilename));    //, "UTF8"
            osw.write(content);
                   
                        osw.close();
**********************************************
and here is the exception I get
org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0xfeff] != old:  [0xef├»]) for encoding change from ISO-8859-1 to UTF-8 at character offset 0

However then I wrote the following code which served my purpose to some extent.But could you please explain what was the issue there and how can i render the encoding of an htm/html file.(offline/saved in my hard drive).

***************
StringExtractor strext = new StringExtractor(input);
String content = strext.extractStrings(false);

        String outputfilename="output.txt";
        OutputStreamWriter osw= new OutputStreamWriter(new FileOutputStream(outputfilename), "UTF8");
        osw.write(content);
*************


Luggage? GPS? Comic books?
Check out fitting gifts for grads at Yahoo! Search.