From: jm <jmu...@gm...> - 2008-05-17 13:10:35
|
I dont have a problem to show my custom code, if I didnt post it before is cause I am not totally sure it works better than the current one in aperture in all cases. I had some tests (cannot look into them now) that returned different amount of text from html, mine returned more, so I kept it, but I dont remember all the details. I did remember I had to do some stuff for better handling encodings. Anyway, as I mentioned, maybe the new code in htmlparser is better. Here is mine: public class ExtractHtml { private static final Logger logger = Logger.getLogger(ExtractHtml.class); /** Regex that matches an encoding String in an xml head. */ private static final Pattern XML_ENCODING_REGEX = Pattern.compile("encoding\\s*=\\s*[\"'].+[\"']", Pattern.CASE_INSENSITIVE); /** Regex that matches an xml head. */ private static final Pattern XML_HEAD_REGEX = Pattern.compile("<\\s*\\?.*\\?\\s*>", Pattern.CASE_INSENSITIVE); public static String extractPlainText(File htmlfile) throws ParserException, IOException { return extractPlainText(new FileInputStream(htmlfile), null); } /** * Extracts the xml encoding setting from an xml file that is contained in a String by parsing the xml head. * <p> * * This is useful if you have a byte array that contains a xml String, but you do not know the xml encoding setting. * Since the encoding setting in the xml head is usually encoded with standard US-ASCII, you usually just create a * String of the byte array without encoding setting, and use this method to find the 'true' encoding. Then create a * String of the byte array again, this time using the found encoding. * <p> * * This method will return <code>null</code> in case no xml head or encoding information is contained in the * input. * <p> * * @param content * the xml content to extract the encoding from * @return the extracted encoding, or null if no xml encoding setting was found in the input */ public static String extractXmlEncoding(String content) { String result = null; Matcher xmlHeadMatcher = XML_HEAD_REGEX.matcher(content); // if (xmlHeadMatcher.find()) { // String xmlHead = xmlHeadMatcher.group(); Matcher encodingMatcher = XML_ENCODING_REGEX.matcher(content); if (encodingMatcher.find()) { String encoding = encodingMatcher.group(); int pos1 = encoding.indexOf('=') + 2; String charset = encoding.substring(pos1, encoding.length() - 1); if (Charset.isSupported(charset)) { result = charset; } } // } return result; } /** the pattern string to extract the encoding from HTML file, if any */ private static String META_PATTERN = "<meta.*?content\\s*=\\s*[\"']\\s*text/html\\s*;\\s*charset\\s*=\\s*(\\S+?)[\"'].*>"; /** compiled pattern to extract the encoding from HTML file, if any */ private static Pattern pattern = Pattern.compile(META_PATTERN); /** Return encoding of HTML file, if defined */ private static String extractHtmlEncoding(byte[] content) throws IOException { BufferedReader reader = new BufferedReader(new java.io.InputStreamReader(new ByteArrayInputStream(content))); StringBuffer buffer = new StringBuffer(); while (reader.ready()) { buffer.append(reader.readLine().toLowerCase()); Matcher matcher = pattern.matcher(buffer); if (matcher.find()) return matcher.group(1); if (buffer.indexOf("</head") >= 0) break; } reader.close(); return ""; } // we set the charset in this way // 1. if the html itself has a charset= inside, we take that // 2. if not we take the one given in to us, if it exists // 3. if not, we take the default public static String extractPlainText(InputStream ir, Charset tcs) throws IOException, ParserException { byte[] thebytes = IOUtils.toByteArray(ir); // find out its enconding (its inside it); String encoding = extractHtmlEncoding(thebytes); if (StringUtils.isBlank(encoding)) { if (tcs != null) { encoding = tcs.name(); } else { encoding = MigConstants.XML_ENCODING; } } String content = new String(thebytes, encoding); StringBean sb = new StringBean(); Parser parser = new Parser(); parser.setInputHTML(content); // String with html parser.visitAllNodesWith(sb); sb.setLinks(false); String res = sb.getStrings(); String ares = ""; // now decode html entities so © = © = (c) if (!StringUtils.isBlank(res)) { ares = Translate.decode(res); } return ares; } public static void main2(String[] args) { boolean links; String url; ExtractHtml se; links = false; url = null; for (int i = 0; i < args.length; i++) if (args[i].equalsIgnoreCase("-links")) { links = true; } else { url = args[i]; } if (null != url) { // se = new ExtractHtml(url); // try { // //System.out.println(se.extractStrings(links)); // } // catch (ParserException e) { // e.printStackTrace(); // } } else { System.out.println("Usage: java -classpath htmlparser.jar org.htmlparser.parserapplications.HtmlTextExtractor [-links] url"); } } } On Sat, May 17, 2008 at 1:40 PM, Antoni Myłka <ant...@gm...> wrote: > jm pisze: >> In case it helps, aperture uses htmlparser. I am not using the >> aperture extractor, but a custom one, also using htmlparser, I think I >> decided to use a custom one cause extraction with the default one was >> not good enought in other encodings than ISO-8859-1. >> >> Just some days ago I was checking out htmlparser again to see if it >> was a new release and I i browsed through the patches etc, I think I >> saw a patch to better handle encodings, you might want to check it >> out. >> >> javi >> > > We all would benefit if you could give some advice how to improve the > current html extractor. Are there any freely-available examples of html > that caused problems with the current htmlextractor/htmlparser > combination and that work fine with your custom one... > > Antoni Mylka > ant...@gm... > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > |