#93 Malformed base tag causes crawling error

v1.2
open
None
5
2012-09-13
2011-05-20
Henry Sudhof
No

If the href attribute of a base tag of is missing or malformed, the crawler will fail.

The cause is this line in getBaseHref in com.jaeksoft.searchlib.parser.HtmlParser :

URL(DomUtils.getAttributeText(list.get(0), "href"));

It should be checked whether list.get(0) is empty or the resulting MalformedURLException should be caught.

Cheers,
~H

Discussion

  • Thank you for your help. A fix has been committed to the current trunk. Planned to be release on 1.2.3

    private static URL getBaseHref(Document doc) {
        String[] p = { "html", "head", "base" };
        List<Node> list = DomUtils.getNodes(doc, p);
        if (list == null)
            return null;
        if (list.size() == 0)
            return null;
        Node node = list.get(0);
        if (node == null)
            return null;
        String url = DomUtils.getAttributeText(node, "href");
        if (url == null)
            return null;
        try {
            return new URL(url);
        } catch (MalformedURLException e) {
            Logging.logger.warn(e);
            return null;
        }
    }
    
     
  • Fixed in v1.2.3-rc1