Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Need help generating AST, or query optimizer bug?

Help
2014-04-04
2014-04-07
  • Peter Amstutz
    Peter Amstutz
    2014-04-04

    Hello, based on the pointers from my previous post, I have been working on implementing permissions for my SPARQL endpoint by annotating the abstract syntax tree of the WHERE clause. In general this approach seems like it will work, but I'm stuck on the following problem with filters.

    SPARQL
    
    SELECT * 
    where { 
     { ?subject ?pred ?obj . }
    FILTER strStarts(str(?subject), "http://arvados.org/schema/modified") .
    } LIMIT 100
    

    This query is annotated with the following permission check. The intention is to test if the user is an admin or if the user 'can_read' the subject resource. Here is the equivalent sparql query, however the code is generating the AST nodes directly. I'm not using FILTER EXISTS for the permission check on the assumption that knowing cardinality of things visible to the user is useful to the optimizer.

    ?authorization_principal arvados:api_token <api_token> .
    { ?authorization_principal arvados:user_is_admin true }
     union
    { ?authorization_principal can_read ?subject }
    

    So, with this query what I would expect to see is up to 100 statements returned where the subject URI starts with "http://arvados.org/schema/modified", and indeed this is exactly what happens if the query is run on its own without the permissions annotations. However, when I add my permissions anotations, it no longer applies the filter to ?subject and I end up getting rows with subjects that don't match the filter.

    What's particularly puzzling is that ?subject VarNode(s) doesn't appear in the joinVars or projectInVars, which makes me suspicious that the VarNode(s) in the permission check is being bound independently of the VarNode(s) in the outer StatementPatternNode, which would explain why I'm getting unfiltered values assigned to ?subject.

    Here is the output of query explain:

    Original AST
    
    QueryType: SELECT
    includeInferred=true
    SELECT * 
      JoinGroupNode {
        JoinGroupNode {
          StatementPatternNode(VarNode(s), VarNode(p), VarNode(o)) [scope=DEFAULT_CONTEXTS]
          UnionNode {
            JoinGroupNode {
              StatementPatternNode(VarNode(-authorization_principal)[anonymous], ConstantNode(TermId(32889U)), ConstantNode(XSDBoolean(true))) [scope=DEFAULT_CONTEXTS]
            }
            JoinGroupNode {
              StatementPatternNode(VarNode(-authorization_principal)[anonymous], ConstantNode(TermId(11421U)), VarNode(s)) [scope=DEFAULT_CONTEXTS]
            }
          }
        }
        FILTER( com.bigdata.rdf.sparql.ast.FunctionNode(FunctionNode(com.bigdata.rdf.internal.constraints.StrBOp(s)[ com.bigdata.rdf.internal.constraints.IVValueExpression.namespace=kb.lex, com.bigdata.rdf.internal.constraints.IVValueExpression.timestamp=-1]),ConstantNode(TermId(0L)[http://arvados.org/schema/modified]))[ com.bigdata.rdf.sparql.ast.FunctionNode.scalarVals=null, com.bigdata.rdf.sparql.ast.FunctionNode.functionURI=http://www.w3.org/2005/xpath-functions#starts-with, valueExpr=com.bigdata.rdf.internal.constraints.StrstartsBOp(com.bigdata.rdf.internal.constraints.StrBOp(s)[ com.bigdata.rdf.internal.constraints.IVValueExpression.namespace=kb.lex, com.bigdata.rdf.internal.constraints.IVValueExpression.timestamp=-1],TermId(0L)[http://arvados.org/schema/modified])] )
        StatementPatternNode(VarNode(-authorization_principal)[anonymous], ConstantNode(TermId(32018U)), ConstantNode(TermId(1510U))) [scope=DEFAULT_CONTEXTS]
      }
    slice(limit=100)
    
    Optimized AST
    
    QueryType: SELECT
    includeInferred=true
    SELECT VarNode(s) VarNode(p) VarNode(o)
      JoinGroupNode {
        StatementPatternNode(VarNode(-authorization_principal)[anonymous], ConstantNode(TermId(32018U)), ConstantNode(TermId(1510U))) [scope=DEFAULT_CONTEXTS]
          com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.estimatedCardinality=1
          com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.originalIndex=POCS
        UnionNode [joinVars=[-authorization_principal]] [projectInVars=[-authorization_principal]] {
          JoinGroupNode [joinVars=[-authorization_principal]] [projectInVars=[-authorization_principal]] {
            StatementPatternNode(VarNode(-authorization_principal)[anonymous], ConstantNode(TermId(32889U)), ConstantNode(XSDBoolean(true))) [scope=DEFAULT_CONTEXTS]
              com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.estimatedCardinality=4
              com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.originalIndex=POCS
          } JOIN ON (-authorization_principal)
          JoinGroupNode [joinVars=[-authorization_principal]] [projectInVars=[-authorization_principal]] {
            StatementPatternNode(VarNode(-authorization_principal)[anonymous], ConstantNode(TermId(11421U)), VarNode(s)) [scope=DEFAULT_CONTEXTS]
              com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.estimatedCardinality=369
              com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.originalIndex=POCS
          } JOIN ON (-authorization_principal)
        } JOIN ON (-authorization_principal)
          FILTER( com.bigdata.rdf.sparql.ast.FunctionNode(FunctionNode(com.bigdata.rdf.internal.constraints.StrBOp(s)[ com.bigdata.rdf.internal.constraints.IVValueExpression.namespace=kb.lex, com.bigdata.rdf.internal.constraints.IVValueExpression.timestamp=-1]),ConstantNode(TermId(0L)[http://arvados.org/schema/modified]))[ com.bigdata.rdf.sparql.ast.FunctionNode.scalarVals=null, com.bigdata.rdf.sparql.ast.FunctionNode.functionURI=http://www.w3.org/2005/xpath-functions#starts-with, valueExpr=com.bigdata.rdf.internal.constraints.StrstartsBOp(com.bigdata.rdf.internal.constraints.StrBOp(s)[ com.bigdata.rdf.internal.constraints.IVValueExpression.namespace=kb.lex, com.bigdata.rdf.internal.constraints.IVValueExpression.timestamp=-1],TermId(0L)[http://arvados.org/schema/modified])] )
        StatementPatternNode(VarNode(s), VarNode(p), VarNode(o)) [scope=DEFAULT_CONTEXTS]
          com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.estimatedCardinality=301975
          com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.originalIndex=SPOC
      }
    slice(limit=100)
    

    Here is the code that adds the annotation:

        void visitJoinGroupNode(JoinGroupNode jgn) throws InvalidClauseException {
            Vector<TermNode> subjects = new Vector<TermNode>();
            Vector<FilterNode> fn = new Vector<FilterNode>();
            for (Object ob : jgn) {
                if (ob instanceof StatementPatternNode) {
                    StatementPatternNode spn = (StatementPatternNode)ob;
                    subjects.add(spn.s());
                }
                else if (ob instanceof JoinGroupNode) {
                    visitJoinGroupNode((JoinGroupNode)ob);
                }
                else if (ob instanceof FilterNode) {
                    //visitFilterNode((FilterNode)ob);
                    //fn.add((FilterNode)ob);
                }
                else {
                    throw new InvalidClauseException(ob.getClass().getName());
                }
            }
    
            for (TermNode t : subjects) {
                UnionNode un = new UnionNode();
                un.addChild(new JoinGroupNode(new StatementPatternNode(user, is_admin, literal_true)));
    
                JoinGroupNode cn = new JoinGroupNode(new StatementPatternNode(user, can_read, (TermNode)t.clone()));
    
                un.addChild(cn);
    
                jgn.addChild(un);
            }
        }
    

    Am I doing something wrong in my AST generation, or is the optimizer messing up my query?

     
    • Bryan Thompson
      Bryan Thompson
      2014-04-04

      Two thoughts.

      For new AST variables, make sure That you use Var.var(name) for the singleton pattern. We use reference tests on AST variables.

      You might need to resolve the IVs for the URL or non inline Literals that you inject into the query. In this part of the AST I see a 0L. An unresolved term identifier. That winds up acting like a variable.

      com.bigdata.rdf.internal.constraints.IVValueExpression.timestamp=-1]),ConstantNode(TermId(0L)[httphttp://arvados.org/schema/modified:http://arvados.org/schema/modified//arvados.org/schema/modifiedhttp://arvados.org/schema/modified]))

      We batch resolve the RDF Value constants when the query is parsed. You probably want to batch resolve additional value objects at the same time. Or, better yet, use a custom Vocabulary so the URLs in the security schema are predefined and do not required an index lookup to translate from a URL to an IV. There are several examples of custom vocabularies in the code. Just be careful if you wind up needing to modify a vocabulary after it has been deployed. You need to create a new version of the vocabulary class to avoid breaking the one that is already deployed.

      Bryan

      On Apr 4, 2014, at 4:37 PM, "Peter Amstutz" tetron@users.sf.net<mailto:tetron@users.sf.net> wrote:

      com.bigdata.rdf.internal.constraints.IVValueExpression.timestamp=-1]),ConstantNode(TermId(0L)[http://arvados.org/schema/modified]http://arvados.org/schema/modified]))

       
      Attachments
      • Peter Amstutz
        Peter Amstutz
        2014-04-04

        Thank you for responding so quickly. However, I don't think those are the root causes of my problems, although it may help me clean up my code a . The TermId(0) is a filter node parameter which is generated by the bigdata parser, and the subject TermId (which could be a variable or literal) is extracted from the statement pattern node which is also from the bigdata parser. I am resolving my constants to IVs, and graph queries that don't use filters seem to work as expected. More to the point, I am able to reproduce the behavior with straight sparql (next post), which suggests the problem is not in my AST generation.

         
  • Peter Amstutz
    Peter Amstutz
    2014-04-04

    I've played around with queries written out in sparql instead of being generated with AST annotations to try and probe the behavior a bit more. Here is the sparql equivalent of the logic in the first post that exhibits the same behavior (?s is not filtered at all).

    prefix xsd:  <http://www.w3.org/2001/XMLSchema#>
    SELECT * 
    where { 
      ?s ?p ?o . 
    FILTER strStarts(str(?s), "http://arvados.org/schema/modified") .
    { ?user <http://arvados.org/schema/user_is_admin> "true"^^xsd:boolean }
    union { ?user <http://arvados.org/schema/permission/can_read> ?s }
    ?user <http://arvados.org/schema/api_token> <token:ckeddwmn3586gsdadbzxosz331crwpfxax58r0k8iq299nmwf> 
    } LIMIT 100
    

    Next, I tried moving the "api_token" statement pattern to the top, running this query I get no results:

    prefix xsd:  <http://www.w3.org/2001/XMLSchema#>
    SELECT * 
    where { 
    ?user <http://arvados.org/schema/api_token> <token:ckeddwmn3586gsdadbzxosz331crwpfxax58r0k8iq299nmwf> .
      ?s ?p ?o . 
    FILTER strStarts(str(?s), "http://arvados.org/schema/modified") .
    { ?user <http://arvados.org/schema/user_is_admin> "true"^^xsd:boolean }
    union { ?user <http://arvados.org/schema/permission/can_read> ?s }
    } LIMIT 100
    

    If I remove the union with "can_read" and only test for "user_is_admin", the query finally yields the expected results:

    prefix xsd:  <http://www.w3.org/2001/XMLSchema#>
    SELECT * 
    where { 
    ?user <http://arvados.org/schema/api_token> <token:ckeddwmn3586gsdadbzxosz331crwpfxax58r0k8iq299nmwf> .
      ?s ?p ?o . 
    FILTER strStarts(str(?s), "http://arvados.org/schema/modified") .
    { ?user <http://arvados.org/schema/user_is_admin> "true"^^xsd:boolean }
    
    } LIMIT 100
    

    Since these queries are logically equivalent as far as my understanding of sparql goes, but I get three different results, either I'm deeply confused about sparql or I am tripping over some kind of bug in Bigdata with handling filters and unions?

     
  • Anton Kulaga
    Anton Kulaga
    2014-04-06

    I also have to deal with BigData AST and they are very uncomfortable to deal with and sometimes behave in mysterious ways.
    One of the hardest thing there is that a lot of movements must be done just to create anything: register everything in Lexer, wrap constants in ConstantNode, provide Expressions for each SPARQL functions and so on and so force. It would be really great if somebody willl make a factory that will handle creation of constant-nodes, filters and sparql functions, freeing us from all those headache.
    Another problem is that API itself is highly mutable and contains a lot of side effects, so when I call a method that returns something I do not know what states are changed by it.

     
  • Mike Personick
    Mike Personick
    2014-04-07

    Peter,

    I tested an alternative query for ticket 874. Please try it out on your data.

    prefix xsd:  <http://www.w3.org/2001/XMLSchema#>
    SELECT * 
    where { 
    ?user <http://arvados.org/schema/api_token> <token:ckedd> .
    {
      ?user <http://arvados.org/schema/user_is_admin> true .
      ?s ?p ?o . 
      FILTER strStarts(str(?s), "http://arvados.org/schema/modified") .
    } 
    union 
    { 
      ?user <http://arvados.org/schema/user_is_admin> false .
      ?user <http://arvados.org/schema/permission/can_read> ?s .
      ?s ?p ?o . 
      FILTER strStarts(str(?s), "http://arvados.org/schema/modified") .
    }
    }
    

    Thanks,
    Mike