[Htmlparser-user] Parsing files / charset issues
Brought to you by:
derrickoswald
From: Nilius F. <Fab...@co...> - 2006-05-30 09:39:39
|
This is more a feature request than a question, since our solutions seems to work. We are parsing existing html files, directly from the file system. It seems to be quite complicated to handle charset/unicode issues correctly.=20 One of the basic problems is that Parser.createParser doesn't take a byte[] as=20 argument, but a String.=20 To transfer a File (which is basically a byte[]) to a String I do need to know=20 the charset/encoding. To know this, I would like to use parser.getEncoding(), to=20 read the meta tags (Content-Type). So, in the sample code attached, we are=20 reading the html file twice: Once with plain ascii encoding (which should=20 be OK for the HTML HEAD), once whith the encoding then provided by the HTML parser. It would be great if the html parser could handle byte[] and would sort out the encoding stuff itself (some guessing might also be done, e.g. handle BOMs=20 (byte order marks)). Thanks Fabian String readHtmlFile (String fileName) throws IOException, UnsupportedEncodingException { String source; String result; try { source =3D _readFile (fileName, null); } catch (UnsupportedEncodingException e) { throw new RuntimeException ("Programming error: Default encoding unsupported?", e); } Parser parser =3D Parser.createParser (source, null); String sourceCodepage; String encoding =3D parser.getEncoding (); try { sourceCodepage =3D readFile (fileName, encoding); result =3D sourceCodepage; } catch (UnsupportedEncodingException e) { System.err.println ("Unsupported HTMl encoding \"" + encoding + "\", using default."); result =3D source; } return result; } String _readFile (String fileName, String codepage) throws FileNotFoundException, UnsupportedEncodingException, IOException { File file =3D new File (fileName); long length =3D file.length (); char[] buffer =3D new char[(int) length]; FileInputStream fileInputStream =3D new FileInputStream (file); InputStreamReader inputStreamReader; if (codepage =3D=3D null) { inputStreamReader =3D new InputStreamReader (fileInputStream); } else { inputStreamReader =3D new InputStreamReader (fileInputStream, codepage); } BufferedReader bufferedReader =3D new BufferedReader (inputStreamReader); int noCharRead =3D bufferedReader.read (buffer, 0 /* offset */, = (int) length); return new String (buffer, 0, noCharRead); } |