Re: [Htmlparser-user] Parsing files / charset issues
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@Ro...> - 2006-05-30 11:04:51
|
Nilius,
I'm surprised it works as you've coded it.
I would have thought you would need to operform
parser.parse (null);
before the getEncoding ();
Otherwise it would still be set to the default encoding.
Derrick
Nilius Fabian wrote:
>This is more a feature request than a question, since our solutions
>seems to work.
>
>We are parsing existing html files, directly from the file system.
>It seems to be quite complicated to handle charset/unicode issues
>correctly.
>
>One of the basic problems is that Parser.createParser doesn't take a
>byte[] as
>argument, but a String.
>
>To transfer a File (which is basically a byte[]) to a String I do need
>to know
>the charset/encoding. To know this, I would like to use
>parser.getEncoding(), to
>read the meta tags (Content-Type). So, in the sample code attached, we
>are
>reading the html file twice: Once with plain ascii encoding (which
>should
>be OK for the HTML HEAD), once whith the encoding then provided by the
>HTML parser.
>
>It would be great if the html parser could handle byte[] and would sort
>out
>the encoding stuff itself (some guessing might also be done, e.g. handle
>BOMs
>(byte order marks)).
>
>Thanks
>
>Fabian
>
>
>
> String readHtmlFile (String fileName) throws IOException,
>UnsupportedEncodingException
> {
> String source;
> String result;
>
> try
> {
> source = _readFile (fileName, null);
> }
> catch (UnsupportedEncodingException e)
> {
> throw new RuntimeException ("Programming error: Default encoding
>unsupported?", e);
> }
>
> Parser parser = Parser.createParser (source, null);
>
> String sourceCodepage;
>
> String encoding = parser.getEncoding ();
>
> try
> {
> sourceCodepage = readFile (fileName, encoding);
> result = sourceCodepage;
> }
> catch (UnsupportedEncodingException e)
> {
> System.err.println ("Unsupported HTMl encoding \"" + encoding
> + "\", using default.");
> result = source;
> }
>
> return result;
> }
>
>
> String _readFile (String fileName, String codepage)
> throws FileNotFoundException, UnsupportedEncodingException,
>IOException
> {
> File file = new File (fileName);
> long length = file.length ();
> char[] buffer = new char[(int) length];
>
> FileInputStream fileInputStream = new FileInputStream (file);
>
> InputStreamReader inputStreamReader;
> if (codepage == null)
> {
> inputStreamReader = new InputStreamReader (fileInputStream);
> }
> else
> {
> inputStreamReader = new InputStreamReader (fileInputStream,
>codepage);
> }
>
> BufferedReader bufferedReader = new BufferedReader
>(inputStreamReader);
>
> int noCharRead = bufferedReader.read (buffer, 0 /* offset */, (int)
>length);
> return new String (buffer, 0, noCharRead);
> }
>
>
>
>-------------------------------------------------------
>All the advantages of Linux Managed Hosting--Without the Cost and Risk!
>Fully trained technicians. The highest number of Red Hat certifications in
>the hosting industry. Fanatical Support. Click to learn more
>http://sel.as-us.falkag.net/sel?cmd=k&kid7521&bid$8729&dat1642
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
|