#578 partial and recursive segmentation of s-units

GREEN
open
Martin Holmes
None
1(low)
2015-07-29
2013-06-05
Kevin Hawkins
No

The content model of <s> allows for this element to nest inside itself. However, the definition has a note that says, "For segmentation which is partial or recursive, the seg should be used instead." In this case, it seems that <s> should not be allowed to nest inside itself, or we should drop that note.

Discussion

  • Lou Burnard
    Lou Burnard
    2013-06-05

    The distinction between <s> and <seg> is precisely that the former
    may not self-nest. In P3 and earlier SGML-based versions of the
    Guidelines this eas enforced by means of an inclusion exception. In P4
    it was not enforced, and the note you refer to was added. In P5 there is
    a schematron rule to enforce this constraint, so I would question your
    assertion that <s> can self-nest. The original intention btw was also
    that <s> should provide an end-to-end segmentation of a text, but we
    have not yet added a constraint to that effect.

     
    Last edit: Kevin Hawkins 2013-06-05
  • Piotr Banski
    Piotr Banski
    2013-06-05

    If <s> is used together with <phr> and <w> to directly reflect the underlying syntactic constituent structure, it makes every sense to let <s> self-nest. It makes no sense not to let it self-nest, in fact.

     
    Last edit: Piotr Banski 2013-11-10
  • Piotr Banski
    Piotr Banski
    2013-06-05

    I think we're looking at an unfortunate mixture of two interpretations of < s>: as a span within running text, and as a syntactic node in a syntactic representation. The note on < seg> that Kevin quotes might make some sense on the former interpretation. It doesn't make any sense whatsoever on the latter interpretation.

     
  • Kevin Hawkins
    Kevin Hawkins
    2013-06-05

    I was looking at the content model, not the presence of any Schematron constraints. I see now that the content model of <s> uses macro.phraseSeq, so I assume that we decided it was more elegant to keep that and add a Schematron constraint rather than set up a content model that includes all of macro.phraseSeq except for <s>.

    I suggest revising the note from:

    For segmentation which is partial or recursive, the seg should be used instead.

    to:

    For end-to-end segmentation which is partial or recursive, seg should be used instead.

     
  • Piotr Banski
    Piotr Banski
    2013-06-05

    What does "end-to-end" mean, please? Especially in connection with "partial".

     
  • Kevin Hawkins
    Kevin Hawkins
    2013-06-05

    I took that language from the previous sentence in the note in the element spec:

    The s element may be used to mark orthographic sentences, or any other segmentation of a text, provided that the segmentation is end-to-end, complete, and non-nesting.

    Lou used it as well, and I think it is being used the way we sometimes use "tessellating": that is, encoding all of the character data in exactly one instance of the element in question.

     
  • James Cummings
    James Cummings
    2013-11-09

    What is needed to close this ticket? More clarity in the guidelines?

     
  • Kevin Hawkins
    Kevin Hawkins
    2013-11-10

    We need to decide whether we think my proposed wording in my comment above ( https://sourceforge.net/p/tei/bugs/578/#5f95 ) is actually clearer or just raises more questions. If it's clearer, we need to decide whether to accept it and who will implement.

     
  • Lou Burnard
    Lou Burnard
    2013-11-10

    Sorry Kevin, but I find your rewording confusing. You can use <seg> for any kind of segmentation, not simply end-to-end segmentation. In fact it is quite plausible to have an end to end segmentation (a tesselation, if you prefer) done with <s> and then to nest <seg>s within them. And, as Piotr, suggests it makes little sense to talk about "partial" end-to-end-segmentation.

     
  • Martin Holmes
    Martin Holmes
    2013-11-12

    Council 2013-11-12: Action on MH to revise either the content model of s so that it doesn't nest (copying macro.phraseSeq and removing s), or removing s from macro.phraseSeq and replacing it manually everywhere macro.phraseSeq would put it.

     
  • Martin Holmes
    Martin Holmes
    2013-11-12

    • assigned_to: Martin Holmes
    • Group: AMBER --> GREEN
    • Priority: 5 --> 1(low)
     
  • Martin Holmes
    Martin Holmes
    2013-11-27

    • status: open --> closed-fixed
     
  • Martin Holmes
    Martin Holmes
    2013-11-27

    I've implemented this at rev 12668, although I must say I don't like the results at all; the content model of <s> is now truly horrible, and will get out of sync with macro.phraseSeq if we're not careful. I would actually recommend reversing this decision and letting the Schematron do the job.

     
  • Lou Burnard
    Lou Burnard
    2015-06-30

    • status: closed-fixed --> open
     
  • Lou Burnard
    Lou Burnard
    2015-06-30

    I'm reopening this because, like Martin who applied it, I think the fix has considerably more defects than the situation it is trying to improve. We just cannot have arbitrary lists of elements in content models which need to be kept in step with class definitions. Either we have to use a new class definition (which seems really silly), or we continue to rely on the (perfectly reasonable) schematron rule to implement the desired additional constraint that <s> cannot self-nest. This ticket is a consequence of someone failing to understand what is (ipso facto) poorly expressed in the Guidelines, so the way to resolve it is to improve that expression, not to cobble together a ridiculous content model which will come back to bite us every other day. For example: suppose we remove an element from model.phrase: it will still appear here. Suppose we define a new model.phrase element: it will not appear here. Finding out/remembering why this idiosyncratic behaviour occurs is a waste of everyone's time.

     
  • Martin Holmes
    Martin Holmes
    2015-06-30

    I'm glad to see this. I don't believe any backward-compatibility issues will arise out of reversing this. The original fix made s-nesting invalid; we'll just be enforcing it through Schematron instead of messing up our content models. +1 from me.

     
  • Martin Holmes
    Martin Holmes
    2015-07-27

    Let's put this on the agenda for the meeting tomorrow and get people's agreement on whether to reverse the original decision.

     
  • Martin Holmes
    Martin Holmes
    2015-07-28

    Council meeting 2015-07-28 says:

    1. Undo the previous change to restore the original content model.

    2. Raise a separate ticket and go back to the linguistic community to ask whether s-units should be allowed to self-nest; if not, a Schematron rule should be created.

     
    Last edit: Martin Holmes 2015-07-28
  • Piotr Banski
    Piotr Banski
    2015-07-28

    Nice, thanks! I'll make sure to bring this up in the agenda for the upcoming LingSIG meeting -- or do you need action earlier (not likely --> summer...)

     
  • Martin Holmes
    Martin Holmes
    2015-07-29

    @Piotr: the sooner you can get us some feedback from the Ling folks the better, I think. Restoring the original content model will allow nesting again per the schema, but leave the issue of the prose constraint unaddressed; that should be either reinforced with a Schematron rule, or (if we believe s-units should be able to nest), deleted, as Kevin suggested.

     
  • Lou Burnard
    Lou Burnard
    2015-07-29

    It is rather naive to assume that "the linguistic community" is a single entity which can be consulted and which will reach a single conclusion. This particular case is a very good example. The reason we have both <s> (non-self-nesting) and <seg> (arbitrary segmentation) is that corpus linguists require the former, since they like to segment their corpora end to end irrespective of any other kind of structure, whereas more analytic linguists need to represent more complex hierarchic (or non hierarchic) segmentation. Yes, one is arguably a special case of the other, and yes, perhaps Occam's razor should have been wielded more effectively, but there are countless millions of words of TEI conformant corpora out there which rely this distinction and this requirement (bequeathed to us by the late Stig Johansson, I think). I really cannot see any argument in favour of allowing <s> to self-nest, if that is what is being proposed:and it would be a Birnbaum-breaking change to the conceptual model to permit it.

     
  • Martin Holmes
    Martin Holmes
    2015-07-29

    It surely wouldn't be a Birnbaum issue, would it? We wouldn't be rendering any existing conformant documents invalid; we'd just be allowing something that wasn't allowed before. Although actually it was, by the schema; it was just disallowed by the prose. What we did in changing the content model before was arguably Birnbaum-breaking, in that it made invalid documents which were using nested s.