Re: [Htmlparser-user] Encoding issue
Brought to you by:
derrickoswald
|
From: Derrick O. <der...@ro...> - 2007-09-26 11:54:42
|
Rupanu,=0A=0AI'm not sure where your problem lies. The exception was raised=
because the encoding of the stream didn't agree with the stated contents o=
f the HTML within it. The code in ConnectionManager that opens a disk file =
- URLConnection openConnection (String string) - uses the override - URLCon=
nection openConnection (URL url) - with the url being the file name prefixe=
d by "file://localhost".=0A=0ASo it's up to the JVM and operating system to=
figure out the encoding of the text file on disk. Apparently, the file was=
not written with the correct encoding bytes at the beginning of the file o=
r something, so this couldn't be figured out and it was opened with ISO-885=
9-1 instead of UTF8 encoding.=0A=0ATo fix it, the text file of HTML needs t=
o be written differently, or you need to open it differently using perhaps =
your own stream passed to the Page constructor.=0A=0ADerrick=0A=0A----- Ori=
ginal Message ----=0AFrom: Rupanu Ranjaneswar <rup...@ya...>=0ATo: =
htm...@li...=0ASent: Wednesday, September 26, 2007=
2:07:27 AM=0ASubject: [Htmlparser-user] Encoding issue=0A=0AHello there,=
=0A=0AWell, I copied and pasted the code you gave but there seems to be an =
issue with encoding.I am trying to read from a non-unicode htm/html file an=
d extract its contents and write them into a text file.=0AHere's the code =
=0A*********************************=0AString inputfile =3D args[0];=0A =
Parser parser =3D new Parser (inputfile);=0A StringBean sb =
=3D new StringBean ();=0A parser.visitAllNodesWith (sb);=0A =
String content =3D sb.getStrings();=0A String outputfilenam=
e=3D "E:\\outputfile.txt"; =0A OutputStreamWriter osw=
=3D new OutputStreamWriter(new FileOutputStream(outputfilename)); //,=0A=
"UTF8"=0A osw.write(content);=0A =0A =
osw.close();=0A********************************************=
**=0Aand here is the exception I get=0Aorg.htmlparser.util.EncodingChangeEx=
ception: character mismatch (new: ? [0xfeff] !=3D old: [0xef=C3=AF]) for e=
ncoding change from ISO-8859-1 to UTF-8 at character offset 0=0A=0AHowever =
then I wrote the following code which served my purpose to some extent.But =
could you please explain what was the issue there and how can i render the =
encoding of an htm/html file.(offline/saved in my hard drive).=0A=0A*******=
********=0AStringExtractor strext =3D new StringExtractor(input);=0AString =
content =3D strext.extractStrings(false);=0A=0A String=0A outputfile=
name=3D"output.txt";=0A OutputStreamWriter osw=3D new OutputStreamWr=
iter(new FileOutputStream(outputfilename), "UTF8");=0A osw.write(con=
tent);=0A*************=0A =0A =0ALuggage? GPS? Comic books? =0A=0AChec=
k out fitting gifts for grads at Yahoo! Search.=0A=0A=0A |