Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

XQJ: html page screen scrap

Help
Chris
2011-02-16
2012-10-08
  • Chris
    Chris
    2011-02-16

    I have been trying to use the XQJ api to screen scrap a html page that has
    been loaded into dom via HTML tidy. The Query I am try to run is:

    <run>
    {
    for $x in //div
    return <race>{$x}</race>
    }
    </run>

    When I run the query it results in just empty <run></run>, however if I change
    for $x in .//div to
    for $x in // the query returns the html between the

    tag which is
    what I want, but I need to filter the results down futher but cannot as even
    //
    /div/h3 returns empty <run> tags.

    I have checked my orignal query using the xquisitor gui tool and it works, but
    just not when I try to implement it using XQJ

    The XQJ code I am using is:
    Document doc = fetchPage(); //fetchs and runs html tidy and returns W3C dom
    document
    SaxonXQDataSource ds = new SaxonXQDataSource(config);
    XQConnection con = ds.getConnection();

    XQItem item = con.createItemFromNode(doc.getChildNodes().item(1),
    con.createNodeType());
    XQPreparedExpression xpres = con.prepareExpression(queryabove);
    xpres.bindItem(XQConstants.CONTEXT_ITEM, item);

    XQResultSequence seq = xpres.executeQuery();

    I'm new to xquery and XQJ so i'm not sure if the problem is with my XQJ code
    or the xquery I'm trying to run.

     
  • Michael Kay
    Michael Kay
    2011-02-16

    If you look more closely at your source XML you will almost certainly find
    that the elements are in a namespace, probably
    http://www.w3.org/1999/xhtml. So if you want
    to select elements from this namespace, you will need to start your query with

    declare default element namespace = "[url]http://www.w3.org/1999/xhtml[/url]";
    

    Unfortunately this will have the side-effect of putting your output elements
    (run and race) in this namespace as well, which is probably not what you want.
    The workaround is to bind a specific prefix

    declare namespace h = "[url]http://www.w3.org/1999/xhtml[/url]";
    

    then write

    for $x in //h:div[...] return ...
    

    This is a weakness in the design of the XQuery language.

    Please note that this forum isn't really intended for general XQuery coding
    help that's independent of the Saxon product. You should try the talk @
    x-query.com mailing list, or stackoverflow.com.