Re: [Htmlparser-user] Encoding issue
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2007-09-26 11:54:42
|
Rupanu,=0A=0AI'm not sure where your problem lies. The exception was raised= because the encoding of the stream didn't agree with the stated contents o= f the HTML within it. The code in ConnectionManager that opens a disk file = - URLConnection openConnection (String string) - uses the override - URLCon= nection openConnection (URL url) - with the url being the file name prefixe= d by "file://localhost".=0A=0ASo it's up to the JVM and operating system to= figure out the encoding of the text file on disk. Apparently, the file was= not written with the correct encoding bytes at the beginning of the file o= r something, so this couldn't be figured out and it was opened with ISO-885= 9-1 instead of UTF8 encoding.=0A=0ATo fix it, the text file of HTML needs t= o be written differently, or you need to open it differently using perhaps = your own stream passed to the Page constructor.=0A=0ADerrick=0A=0A----- Ori= ginal Message ----=0AFrom: Rupanu Ranjaneswar <rup...@ya...>=0ATo: = htm...@li...=0ASent: Wednesday, September 26, 2007= 2:07:27 AM=0ASubject: [Htmlparser-user] Encoding issue=0A=0AHello there,= =0A=0AWell, I copied and pasted the code you gave but there seems to be an = issue with encoding.I am trying to read from a non-unicode htm/html file an= d extract its contents and write them into a text file.=0AHere's the code = =0A*********************************=0AString inputfile =3D args[0];=0A = Parser parser =3D new Parser (inputfile);=0A StringBean sb = =3D new StringBean ();=0A parser.visitAllNodesWith (sb);=0A = String content =3D sb.getStrings();=0A String outputfilenam= e=3D "E:\\outputfile.txt"; =0A OutputStreamWriter osw= =3D new OutputStreamWriter(new FileOutputStream(outputfilename)); //,=0A= "UTF8"=0A osw.write(content);=0A =0A = osw.close();=0A********************************************= **=0Aand here is the exception I get=0Aorg.htmlparser.util.EncodingChangeEx= ception: character mismatch (new: ? [0xfeff] !=3D old: [0xef=C3=AF]) for e= ncoding change from ISO-8859-1 to UTF-8 at character offset 0=0A=0AHowever = then I wrote the following code which served my purpose to some extent.But = could you please explain what was the issue there and how can i render the = encoding of an htm/html file.(offline/saved in my hard drive).=0A=0A*******= ********=0AStringExtractor strext =3D new StringExtractor(input);=0AString = content =3D strext.extractStrings(false);=0A=0A String=0A outputfile= name=3D"output.txt";=0A OutputStreamWriter osw=3D new OutputStreamWr= iter(new FileOutputStream(outputfilename), "UTF8");=0A osw.write(con= tent);=0A*************=0A =0A =0ALuggage? GPS? Comic books? =0A=0AChec= k out fitting gifts for grads at Yahoo! Search.=0A=0A=0A |