Re: [Htmlparser-user] Parsing files / charset issues
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-05-30 11:04:51
|
Nilius, I'm surprised it works as you've coded it. I would have thought you would need to operform parser.parse (null); before the getEncoding (); Otherwise it would still be set to the default encoding. Derrick Nilius Fabian wrote: >This is more a feature request than a question, since our solutions >seems to work. > >We are parsing existing html files, directly from the file system. >It seems to be quite complicated to handle charset/unicode issues >correctly. > >One of the basic problems is that Parser.createParser doesn't take a >byte[] as >argument, but a String. > >To transfer a File (which is basically a byte[]) to a String I do need >to know >the charset/encoding. To know this, I would like to use >parser.getEncoding(), to >read the meta tags (Content-Type). So, in the sample code attached, we >are >reading the html file twice: Once with plain ascii encoding (which >should >be OK for the HTML HEAD), once whith the encoding then provided by the >HTML parser. > >It would be great if the html parser could handle byte[] and would sort >out >the encoding stuff itself (some guessing might also be done, e.g. handle >BOMs >(byte order marks)). > >Thanks > >Fabian > > > > String readHtmlFile (String fileName) throws IOException, >UnsupportedEncodingException > { > String source; > String result; > > try > { > source = _readFile (fileName, null); > } > catch (UnsupportedEncodingException e) > { > throw new RuntimeException ("Programming error: Default encoding >unsupported?", e); > } > > Parser parser = Parser.createParser (source, null); > > String sourceCodepage; > > String encoding = parser.getEncoding (); > > try > { > sourceCodepage = readFile (fileName, encoding); > result = sourceCodepage; > } > catch (UnsupportedEncodingException e) > { > System.err.println ("Unsupported HTMl encoding \"" + encoding > + "\", using default."); > result = source; > } > > return result; > } > > > String _readFile (String fileName, String codepage) > throws FileNotFoundException, UnsupportedEncodingException, >IOException > { > File file = new File (fileName); > long length = file.length (); > char[] buffer = new char[(int) length]; > > FileInputStream fileInputStream = new FileInputStream (file); > > InputStreamReader inputStreamReader; > if (codepage == null) > { > inputStreamReader = new InputStreamReader (fileInputStream); > } > else > { > inputStreamReader = new InputStreamReader (fileInputStream, >codepage); > } > > BufferedReader bufferedReader = new BufferedReader >(inputStreamReader); > > int noCharRead = bufferedReader.read (buffer, 0 /* offset */, (int) >length); > return new String (buffer, 0, noCharRead); > } > > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmd=k&kid7521&bid$8729&dat1642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |