Menu

#187 Support <?xml-model?> link to RelaxNG schema in XML Plugin

open
nobody
5
2019-12-09
2019-12-03
Conal Tuohy
No

A patch to add support for <?xml-model> processing instructions as a way to associate a RelaxNG schema

1 Attachments

Discussion

  • Conal Tuohy

    Conal Tuohy - 2019-12-03

    Adding the XML plugin artifact as an attachment

     
  • Eric Le Lay

    Eric Le Lay - 2019-12-08

    Thanks for the patch.

    Regarding documentation (good that you updated it):
    1. tei hrefs (http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_tite.rng) end in 404 errors. Is it expected?
    2. a pointer to the validation section would be great, for another "external" way to specify RNG schemas.
    3. the new version in CHANGES.txt is 3.0.7

    Regarding the code
    1. processingInstruction()
    1.1 more cases should be supported based on type and schematypens: rnc, xsd, RelaxNG, as in https://www.w3.org/TR/xml-model/#d0e689
    1.2 schematron is not actually supported, is it?
    1.3 you should get the doc's systemId using xml.PathUtilities.pathToURL(systemId) instead of using it directly
    2. getPseudoAttributes
    2.1 the value part of the regex is incorrect: it should match until the quote, not until '=' and support both single and double quotes, as in the spec (dt-parsing)
    2.2 what about & and <? Can you verify if it's unescaped by the parser? If not we have to document the limitation or do it ourself...

     
  • Conal Tuohy

    Conal Tuohy - 2019-12-09

    Documentation:
    1) the 404 on the TEI schemas is expected (the TEI-C website has a configuration problem). The error has been reported, and a redirect is going to be put in place. These HTTP URLs are the officially published ones at the moment though.
    2) I don't quite understand what you're suggesting. But I did read the "validation" section and noticed the outdated text there (which said that a RelaxNG schema could not be specified in a document), which I've now corrected.
    3) fixed
    I'll update the patch shortly

     
  • Conal Tuohy

    Conal Tuohy - 2019-12-09

    Regarding the code:

    1 processingInstruction method
    1.1 Yes I have a plan to also support schematron, but I prioritised RelaxNG because the RelaxNG schema language and the <?xml-model?> PI are the most commonly used in the Text Encoding community, and they were not supported.
    1.2 I am working on Schematron next, but of course the validation itself is very different.
    1.3 Done, thank you.
    2 getPseudoAttributes
    2.1 I have reviewed the pseudo-attribute parsing with more care, and dealt with both the double- and single-quote cases.
    2.2 I believe I have now correctly parsed the PI including, including dealing with encoded chars.

    Thanks for your review!

    See new attached diff

     
  • Eric Le Lay

    Eric Le Lay - 2019-12-09

    1.1 I would also prioritize RelaxNG. In fact, I don't ask you to implement Schematron at all if you don't need it ;-) What I don't understand is that you match on "http://purl.oclc.org/dsdl/schematron" not "http://relaxng.org/ns/structure/1.0".

    1. Yes, PI parsing looks perfect! Only a style issue: I prefer final static fields upper-cased.
      Also, I would store the compiled Pattern in the static field (private static final Pattern PSEUDO_ATTRIBUTE = Pattern.compile("...")) when applicable. Same for the pattern in getPseudoAttributeValue().

    Thanks,

     
    • Alan Ezust

      Alan Ezust - 2019-12-09

      I just want to say, reading this thread and seeing you help each other,
      that makes me very happy!
      Especially for working on the XML plugin, which is my favorite plugin.

       
      😄
      1
  • Conal Tuohy

    Conal Tuohy - 2019-12-09

    Thanks very much for spotting the namespace confusion, Eric!

    I was wondering how I could have made that mistake and still had the schema actually work!

    It turns out that the sample TEI XML files I've been working with all have ?xml-model? PIs for both Schematron and RelaxNG,, and the href pseudo-attribute value is the same for both the schema types (the RelaxNG schema file actually has Schematron rules embedded in it). I had made a stupid copy-and-paste error, but everything had worked for me, because my "schematron" schema was actually also a RelaxNG schema. I've fixed that now, and I've made the stylistic change to the static field names, and I've made the Pattern objects final and static too.

    Once more, with the diff:

     
  • Eric Le Lay

    Eric Le Lay - 2019-12-09

    Cool! Now I've only one request: put curly braces around cases in getPseudoAttributeValue because it's such a common error to add another indented line to the last case and think that it only applies in it, but it doesn't : it always applies. Or for short code you can put it on the same line as the if. Then braces are not necessary.

                if (matcher.group(1) != null) unescapedValue.append("&"); // &amp;
                else if (matcher.group(2) != null) unescapedValue.append("<"); // &lt;
                else if (matcher.group(3) != null) unescapedValue.append(">"); // &gt;
                else if (matcher.group(4) != null) unescapedValue.append("\""); // &quot;
                else if (matcher.group(5) != null) unescapedValue.append("'"); // &apos;
                else if (matcher.group(6) != null) {
                    unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(6))); // decimal unicode codepoint
                } else if (matcher.group(7) != null) {
                    unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(7),16)); // hex unicode codepoint
                } else if (matcher.group(8) != null) 
                    unescapedValue.append(matcher.group(8)); // any other character is copied unchanged
                }
    
     
  • Conal Tuohy

    Conal Tuohy - 2019-12-09

    OK I have chosen to use curly braces in all cases, even where the append call is short enough that it would have fitted on one line; I think the consistency makes the method more readable.

    if (matcher.group(1) != null) {// &amp;
        unescapedValue.append("&"); 
    } else if (matcher.group(2) != null) {// &lt;
        unescapedValue.append("<"); 
    } else if (matcher.group(3) != null) {// &gt;
        unescapedValue.append(">"); 
    } else if (matcher.group(4) != null) {// &quot;
        unescapedValue.append("\""); 
    } else if (matcher.group(5) != null) {// &apos;
        unescapedValue.append("'"); 
    } else if (matcher.group(6) != null) {// decimal unicode codepoint
        unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(6))); 
    } else if (matcher.group(7) != null) { // hex unicode codepoint
        unescapedValue.appendCodePoint(Integer.parseInt(matcher.group(7),16)); 
    } else if (matcher.group(8) != null) { // any other character copied unchanged
        unescapedValue.append(matcher.group(8)); 
    }
    
     

    Last edit: Conal Tuohy 2019-12-09
  • Eric Le Lay

    Eric Le Lay - 2019-12-09

    That's fine :-)

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.