Menu

Regex case-insensitivity and Unicode chars

Help
2013-03-19
2014-02-19
  • Osma Suominen

    Osma Suominen - 2013-03-19

    Hi!

    I'm trying to do case-independent matching of literals with Bigdata SPARQL queries (RELEASE_1_2_0 from a few days ago). I noticed that the case-insensitive flag "i" works for English/ASCII letters, but not for Unicode accented characters, e.g. the Scandinavian characters åäö.

    Test data:

    PREFIX ns: <http://example.org/ns#>
    INSERT DATA 
    { GRAPH ns:graph { ns:auml ns:label "Ä", "ä" } }
    

    Test query:

    PREFIX ns: <http://example.org/ns#>
    SELECT * { 
      GRAPH ns:graph {
        ?s ?p ?o
        FILTER(regex(?o, "ä", "i"))
      }
    }
    

    Bigdata returns only the triple with lowercase 'ä'. Jena ARQ+TDB returns both triples, which I think is the correct behavior. SPARQL inherits its regex syntax from XQuery, which in my reading says that the case-insensitivity is performed according to Unicode rules (not e.g. depending on locale).

    Is this a bug or am I doing something wrong?

    Best regards,
    Osma Suominen

     
  • Bryan Thompson

    Bryan Thompson - 2013-03-19

    It is possible that there is a problem here, but it also might have to do with the way in which you have setup Unicode support.

    Bigdata uses icu4j to handle Unicode processing.  The dictionary indices (especially the TERM2ID index) are configured using Unicode collation rules that produce a sort key.  See the

    com.bigdata.btree.keys.KeyBuilder.Options
    

    interface for the configuration options that you can use to control how those sort keys are generated.  You need to do this when the KB is created. 

    Make sure that your configuration does NOT include the following.

    com.bigdata.btree.keys.KeyBuilder.collator=ASCII
    

    However, those things mainly affect the B+Tree keys, while regex is a runtime function over strings.  The relevant bit of code for REGEX appears to be this snip from RegexBOp.  I notice that there is an 'u' flag that can be used to force Unicode case folding.  Can you test that flag and see if it provides the desired behavior?  If so, then maybe the "fix" is to always enable that flag rather than conditionally enabling it as in this code.

        private static Pattern getPattern(final Value parg, final Value farg) 
                throws IllegalArgumentException {
            
            if (log.isDebugEnabled()) {
                log.debug("regex pattern: " + parg);
                log.debug("regex flags: " + farg);
            }
            
            if (QueryEvaluationUtil.isSimpleLiteral(parg)
                    && (farg == null || QueryEvaluationUtil.isSimpleLiteral(farg))) {
                final String ptn = ((Literal) parg).getLabel();
                String flags = "";
                if (farg != null) {
                    flags = ((Literal)farg).getLabel();
                }
                int f = 0;
                for (char c : flags.toCharArray()) {
                    switch (c) {
                        case 's':
                            f |= Pattern.DOTALL;
                            break;
                        case 'm':
                            f |= Pattern.MULTILINE;
                            break;
                        case 'i':
                            f |= Pattern.CASE_INSENSITIVE;
                            break;
                        case 'x':
                            f |= Pattern.COMMENTS;
                            break;
                        case 'd':
                            f |= Pattern.UNIX_LINES;
                            break;
                        case 'u':
                            f |= Pattern.UNICODE_CASE;
                            break;
                        default:
                            throw new IllegalArgumentException();
                    }
                }
                final Pattern pattern = Pattern.compile(ptn, f);
                return pattern;
            }
    
            throw new IllegalArgumentException();
    
        }
    

    Thanks,
    Bryan

     
  • Bryan Thompson

    Bryan Thompson - 2013-03-19

    It would be great to get some test cases that we could use to exercise this properly.  I am unsure what will happen if I try to cut and paste from this forum post into local files due to difficulties with internationalization.  To avoid such problems, could you create a ticket and attach some data files with the source triples, query, and expected results then we could incorporate that into the CI test suite?

    You can see examples of how we do this in the following package

    bigdata-rdf/src/test/com/bigdata/rdf/sparql/ast/eval
    

    That package includes both TestXXX classes and the source data file, query, and expected results files.  This is very similar to the DAWG manifest driven test suites, except that we do not use the manifest.   Some of the test classes explicitly override the KB initialization properties.  That might be something we need to do for internationalization tests.  E.g., make sure that it is using an appropriate collation when indexing the RDF Values in the TERM2ID index.

    Thanks,
    Bryan

     
  • Osma Suominen

    Osma Suominen - 2013-03-20

    Thanks again for your very quick response and your suggestions.

    I tried the 'u' regex flag as suggested. It didn't work right away, but I then noticed that there are actually two separate issues here.

    I'm using the NanoSparqlServer web interface with a Firefox browser. Bigdata is running as a webapp inside Tomcat.
    My RWstore.properties configuration is the stock configuration, except I've enabled quads mode and the text index. There is no collator=ASCII option or similar.

    1. The web interface doesn't seem to be "utf-8 clean". For the NanoSparqlServer form at http://localhost:8080/bigdata/ the browser reports a content type of "text/html;charset=utf-8" set using <meta http-equiv="content-type" …/> and accordingly renders the form as UTF-8. But when I type in Unicode characters into the SPARQL Update box, these are apparently not processed correctly and they seem to end up as UTF-8 code sequences in the database, instead of proper Unicode characters (whatever the internal representation is supposed to be). At least subsequent SPARQL queries seem to return doubly UTF-8 encoded characters. This may be a Tomcat configuration issue, I have to check that. In this situation the 'u' flag doesn't help, because the literals in the database are wrongly encoded already at insert time.

    2. If I work around this problem by switching the unencoded Unicode characters into SPARQL escape sequences like this:

    Insert statement:

    PREFIX ns: <http://example.org/ns#>
    INSERT DATA { GRAPH ns:graph { ns:auml ns:label "\u00C4", "\u00E4" } }
    

    Test query:

    PREFIX ns: <http://example.org/ns#>
    SELECT * { GRAPH ns:graph { ?s ?p ?o FILTER(regex(?o, "\u00E4", "i")) } }
    

    Then I still get only one result for the query, the triple with 'ä' which is \u00E4. But if I now add the 'u' flag to the regex, I get both triples as result, so this seems to be a viable workaround. Always setting the UNICODE_CASE flag sounds like a good idea, and in fact, Jena ARQ seems to do that when given the 'i' flag:

        public static int makeMask(String modifiers)
        {
            if ( modifiers == null )
                return 0 ;
            int newMask = 0 ;
            for ( int i = 0 ; i < modifiers.length() ; i++ )
            {
                switch(modifiers.charAt(i))
                {
                    //case 'i' : newMask |= Pattern.CASE_INSENSITIVE;     break ;
                    case 'i' : 
                        // Need both (Java 1.4)
                        newMask |= Pattern.UNICODE_CASE ;   
                        newMask |= Pattern.CASE_INSENSITIVE;
                        break ;
                    case 'm' : newMask |= Pattern.MULTILINE ;           break ;
                    case 's' : newMask |= Pattern.DOTALL ;              break ;
                    //case 'x' : newMask |= Pattern.;  break ;
                 
                    default  : 
                        throw new QueryParseException("Illegal flag in regex modifiers: "+modifiers.charAt(i), -1, -1) ;
                }
            }
            return newMask ;
        }
    

    (code from RegexJava class in jena-arq)

    I can make a ticket with a test case if it's necessary, though the above queries which use SPARQL escape sequences should be safe to copy and paste as they are plain ASCII.

    There are rather many Unicode characters to test though… Maybe it's enough to try with a few characters and trust that icu4j will do the right thing as long as it is given proper options.

    Osma

     
  • Bryan Thompson

    Bryan Thompson - 2013-03-20

    I have filed a ticket for the Unicode REGEX issue  and committed a fix (r7018).  Please verify that the fix works for you.

    I am also putting together a page on Unicode and Internationalization support  that is linked from the main wiki page .

    Thanks,
    Bryan

    http://sourceforge.net/apps/trac/bigdata/ticket/655 (SPARQL REGEX operator does not perform case-folding correctly for Unicode data)

     
  • Osma Suominen

    Osma Suominen - 2013-03-20

    That was fast! Thanks a lot!

    I verified that the fix works for me, both with the test case above and with my actual application where I originally noticed the problem.

    You seem to have forgotten URLs for  and  from your previous post?

    Thanks,
    Osma

     

Log in to post a comment.

MongoDB Logo MongoDB