Re: [Aperture-devel] Charsets and Extraction

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I dont have a problem to show my custom code, if I didnt post it
before is cause I am not totally sure it works better than the current
one in aperture in all cases. I had some tests (cannot look into them
now) that returned different amount of text from html, mine returned
more, so I kept it, but I dont remember all the details. I did
remember I had to do some stuff for better handling encodings.

Anyway, as I mentioned, maybe the new code in htmlparser is better.

Here is mine:

public class ExtractHtml {
    private static final Logger logger = Logger.getLogger(ExtractHtml.class);

    /** Regex that matches an encoding String in an xml head. */
    private static final Pattern XML_ENCODING_REGEX =
Pattern.compile("encoding\\s*=\\s*[\"'].+[\"']",
Pattern.CASE_INSENSITIVE);

    /** Regex that matches an xml head. */
    private static final Pattern XML_HEAD_REGEX =
Pattern.compile("<\\s*\\?.*\\?\\s*>", Pattern.CASE_INSENSITIVE);

    public static String extractPlainText(File htmlfile) throws
ParserException, IOException {
        return extractPlainText(new FileInputStream(htmlfile), null);
    }

    /**
     * Extracts the xml encoding setting from an xml file that is
contained in a String by parsing the xml head.
     * <p>
     *
     * This is useful if you have a byte array that contains a xml
String, but you do not know the xml encoding setting.
     * Since the encoding setting in the xml head is usually encoded
with standard US-ASCII, you usually just create a
     * String of the byte array without encoding setting, and use this
method to find the 'true' encoding. Then create a
     * String of the byte array again, this time using the found encoding.
     * <p>
     *
     * This method will return <code>null</code> in case no xml head
or encoding information is contained in the
     * input.
     * <p>
     *
     * @param content
     *            the xml content to extract the encoding from
     * @return the extracted encoding, or null if no xml encoding
setting was found in the input
     */
    public static String extractXmlEncoding(String content) {
        String result = null;
        Matcher xmlHeadMatcher = XML_HEAD_REGEX.matcher(content);
        // if (xmlHeadMatcher.find()) {
        // String xmlHead = xmlHeadMatcher.group();
        Matcher encodingMatcher = XML_ENCODING_REGEX.matcher(content);
        if (encodingMatcher.find()) {
            String encoding = encodingMatcher.group();
            int pos1 = encoding.indexOf('=') + 2;
            String charset = encoding.substring(pos1, encoding.length() - 1);
            if (Charset.isSupported(charset)) {
                result = charset;
            }
        }
        // }
        return result;
    }

    /** the pattern string to extract the encoding from HTML file, if any */
    private static String META_PATTERN =
"<meta.*?content\\s*=\\s*[\"']\\s*text/html\\s*;\\s*charset\\s*=\\s*(\\S+?)[\"'].*>";
    /** compiled pattern to extract the encoding from HTML file, if any */
    private static Pattern pattern = Pattern.compile(META_PATTERN);

    /** Return encoding of HTML file, if defined */
    private static String extractHtmlEncoding(byte[] content) throws
IOException {
        BufferedReader reader = new BufferedReader(new
java.io.InputStreamReader(new ByteArrayInputStream(content)));
        StringBuffer buffer = new StringBuffer();
        while (reader.ready()) {
            buffer.append(reader.readLine().toLowerCase());
            Matcher matcher = pattern.matcher(buffer);
            if (matcher.find())
                return matcher.group(1);
            if (buffer.indexOf("</head") >= 0)
                break;
        }
        reader.close();
        return "";
    }

    // we set the charset in this way
    // 1. if the html itself has a charset= inside, we take that
    // 2. if not we take the one given in to us, if it exists
    // 3. if not, we take the default
    public static String extractPlainText(InputStream ir, Charset tcs)
throws IOException, ParserException {
        byte[] thebytes = IOUtils.toByteArray(ir);
        // find out its enconding (its inside it);
        String encoding = extractHtmlEncoding(thebytes);
        if (StringUtils.isBlank(encoding)) {
            if (tcs != null) {
                encoding = tcs.name();
            } else {
                encoding = MigConstants.XML_ENCODING;
            }
        }
        String content = new String(thebytes, encoding);
        StringBean sb = new StringBean();
        Parser parser = new Parser();
        parser.setInputHTML(content); // String with html
        parser.visitAllNodesWith(sb);
        sb.setLinks(false);
        String res = sb.getStrings();
        String ares = "";
        // now decode html entities so &copy; = &#169; = (c)
        if (!StringUtils.isBlank(res)) {
            ares = Translate.decode(res);
        }
        return ares;
    }

    public static void main2(String[] args) {
        boolean links;
        String url;
        ExtractHtml se;

        links = false;
        url = null;

        for (int i = 0; i < args.length; i++)
            if (args[i].equalsIgnoreCase("-links")) {
                links = true;
            } else {
                url = args[i];
            }

        if (null != url) {
            // se = new ExtractHtml(url);

            // try {
            // //System.out.println(se.extractStrings(links));
            // }
            // catch (ParserException e) {
            // e.printStackTrace();
            // }
        } else {
            System.out.println("Usage: java -classpath htmlparser.jar
org.htmlparser.parserapplications.HtmlTextExtractor [-links] url");
        }
    }

}

On Sat, May 17, 2008 at 1:40 PM, Antoni Myłka <ant...@gm...> wrote:
> jm pisze:
>> In case it helps, aperture uses htmlparser. I am not using the
>> aperture extractor, but a custom one, also using htmlparser, I think I
>> decided to use a custom one cause extraction with the default one was
>> not good enought in other encodings than ISO-8859-1.
>>
>> Just some days ago I was checking out htmlparser again to see if it
>> was a new release and I i browsed through the patches etc, I think I
>> saw a patch to better handle encodings, you might want to check it
>> out.
>>
>> javi
>>
>
> We all would benefit if you could give some advice how to improve the
> current html extractor. Are there any freely-available examples of html
> that caused problems with the current htmlextractor/htmlparser
> combination and that work fine  with your custom one...
>
> Antoni Mylka
> ant...@gm...
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>