[Htmlparser-user] Parsing files / charset issues
Brought to you by:
derrickoswald
|
From: Nilius F. <Fab...@co...> - 2006-05-30 09:39:39
|
This is more a feature request than a question, since our solutions
seems to work.
We are parsing existing html files, directly from the file system.
It seems to be quite complicated to handle charset/unicode issues
correctly.=20
One of the basic problems is that Parser.createParser doesn't take a
byte[] as=20
argument, but a String.=20
To transfer a File (which is basically a byte[]) to a String I do need
to know=20
the charset/encoding. To know this, I would like to use
parser.getEncoding(), to=20
read the meta tags (Content-Type). So, in the sample code attached, we
are=20
reading the html file twice: Once with plain ascii encoding (which
should=20
be OK for the HTML HEAD), once whith the encoding then provided by the
HTML parser.
It would be great if the html parser could handle byte[] and would sort
out
the encoding stuff itself (some guessing might also be done, e.g. handle
BOMs=20
(byte order marks)).
Thanks
Fabian
String readHtmlFile (String fileName) throws IOException,
UnsupportedEncodingException
{
String source;
String result;
try
{
source =3D _readFile (fileName, null);
}
catch (UnsupportedEncodingException e)
{
throw new RuntimeException ("Programming error: Default encoding
unsupported?", e);
}
Parser parser =3D Parser.createParser (source, null);
String sourceCodepage;
String encoding =3D parser.getEncoding ();
try
{
sourceCodepage =3D readFile (fileName, encoding);
result =3D sourceCodepage;
}
catch (UnsupportedEncodingException e)
{
System.err.println ("Unsupported HTMl encoding \"" + encoding
+ "\", using default.");
result =3D source;
}
return result;
}
String _readFile (String fileName, String codepage)
throws FileNotFoundException, UnsupportedEncodingException,
IOException
{
File file =3D new File (fileName);
long length =3D file.length ();
char[] buffer =3D new char[(int) length];
FileInputStream fileInputStream =3D new FileInputStream (file);
InputStreamReader inputStreamReader;
if (codepage =3D=3D null)
{
inputStreamReader =3D new InputStreamReader (fileInputStream);
}
else
{
inputStreamReader =3D new InputStreamReader (fileInputStream,
codepage);
}
BufferedReader bufferedReader =3D new BufferedReader
(inputStreamReader);
int noCharRead =3D bufferedReader.read (buffer, 0 /* offset */, =
(int)
length);
return new String (buffer, 0, noCharRead);
}
|