Menu

#578 partial and recursive segmentation of s-units

GREEN
open
None
1(low)
2015-07-29
2013-06-05
No

The content model of <s> allows for this element to nest inside itself. However, the definition has a note that says, "For segmentation which is partial or recursive, the seg should be used instead." In this case, it seems that <s> should not be allowed to nest inside itself, or we should drop that note.

Discussion

  • Lou Burnard

    Lou Burnard - 2013-06-05

    The distinction between <s> and <seg> is precisely that the former
    may not self-nest. In P3 and earlier SGML-based versions of the
    Guidelines this eas enforced by means of an inclusion exception. In P4
    it was not enforced, and the note you refer to was added. In P5 there is
    a schematron rule to enforce this constraint, so I would question your
    assertion that <s> can self-nest. The original intention btw was also
    that <s> should provide an end-to-end segmentation of a text, but we
    have not yet added a constraint to that effect.

     

    Last edit: Kevin Hawkins 2013-06-05
  • Piotr Banski

    Piotr Banski - 2013-06-05

    If <s> is used together with <phr> and <w> to directly reflect the underlying syntactic constituent structure, it makes every sense to let <s> self-nest. It makes no sense not to let it self-nest, in fact.

     

    Last edit: Piotr Banski 2013-11-10
  • Piotr Banski

    Piotr Banski - 2013-06-05

    I think we're looking at an unfortunate mixture of two interpretations of < s>: as a span within running text, and as a syntactic node in a syntactic representation. The note on < seg> that Kevin quotes might make some sense on the former interpretation. It doesn't make any sense whatsoever on the latter interpretation.

     
  • Kevin Hawkins

    Kevin Hawkins - 2013-06-05

    I was looking at the content model, not the presence of any Schematron constraints. I see now that the content model of <s> uses macro.phraseSeq, so I assume that we decided it was more elegant to keep that and add a Schematron constraint rather than set up a content model that includes all of macro.phraseSeq except for <s>.

    I suggest revising the note from:

    For segmentation which is partial or recursive, the seg should be used instead.

    to:

    For end-to-end segmentation which is partial or recursive, seg should be used instead.

     
  • Piotr Banski

    Piotr Banski - 2013-06-05

    What does "end-to-end" mean, please? Especially in connection with "partial".

     
  • Kevin Hawkins

    Kevin Hawkins - 2013-06-05

    I took that language from the previous sentence in the note in the element spec:

    The s element may be used to mark orthographic sentences, or any other segmentation of a text, provided that the segmentation is end-to-end, complete, and non-nesting.

    Lou used it as well, and I think it is being used the way we sometimes use "tessellating": that is, encoding all of the character data in exactly one instance of the element in question.

     
  • James Cummings

    James Cummings - 2013-11-09

    What is needed to close this ticket? More clarity in the guidelines?

     
  • Kevin Hawkins

    Kevin Hawkins - 2013-11-10

    We need to decide whether we think my proposed wording in my comment above ( https://sourceforge.net/p/tei/bugs/578/#5f95 ) is actually clearer or just raises more questions. If it's clearer, we need to decide whether to accept it and who will implement.

     
  • Lou Burnard

    Lou Burnard - 2013-11-10

    Sorry Kevin, but I find your rewording confusing. You can use <seg> for any kind of segmentation, not simply end-to-end segmentation. In fact it is quite plausible to have an end to end segmentation (a tesselation, if you prefer) done with <s> and then to nest <seg>s within them. And, as Piotr, suggests it makes little sense to talk about "partial" end-to-end-segmentation.

     
  • Martin Holmes

    Martin Holmes - 2013-11-12

    Council 2013-11-12: Action on MH to revise either the content model of s so that it doesn't nest (copying macro.phraseSeq and removing s), or removing s from macro.phraseSeq and replacing it manually everywhere macro.phraseSeq would put it.

     
  • Martin Holmes

    Martin Holmes - 2013-11-12
    • assigned_to: Martin Holmes
    • Group: AMBER --> GREEN
    • Priority: 5 --> 1(low)
     
  • Martin Holmes

    Martin Holmes - 2013-11-27
    • status: open --> closed-fixed
     
  • Martin Holmes

    Martin Holmes - 2013-11-27

    I've implemented this at rev 12668, although I must say I don't like the results at all; the content model of <s> is now truly horrible, and will get out of sync with macro.phraseSeq if we're not careful. I would actually recommend reversing this decision and letting the Schematron do the job.

     
  • Lou Burnard

    Lou Burnard - 2015-06-30
    • status: closed-fixed --> open
     
  • Lou Burnard

    Lou Burnard - 2015-06-30

    I'm reopening this because, like Martin who applied it, I think the fix has considerably more defects than the situation it is trying to improve. We just cannot have arbitrary lists of elements in content models which need to be kept in step with class definitions. Either we have to use a new class definition (which seems really silly), or we continue to rely on the (perfectly reasonable) schematron rule to implement the desired additional constraint that <s> cannot self-nest. This ticket is a consequence of someone failing to understand what is (ipso facto) poorly expressed in the Guidelines, so the way to resolve it is to improve that expression, not to cobble together a ridiculous content model which will come back to bite us every other day. For example: suppose we remove an element from model.phrase: it will still appear here. Suppose we define a new model.phrase element: it will not appear here. Finding out/remembering why this idiosyncratic behaviour occurs is a waste of everyone's time.

     
  • Martin Holmes

    Martin Holmes - 2015-06-30

    I'm glad to see this. I don't believe any backward-compatibility issues will arise out of reversing this. The original fix made s-nesting invalid; we'll just be enforcing it through Schematron instead of messing up our content models. +1 from me.

     
  • Martin Holmes

    Martin Holmes - 2015-07-27

    Let's put this on the agenda for the meeting tomorrow and get people's agreement on whether to reverse the original decision.

     
  • Martin Holmes

    Martin Holmes - 2015-07-28

    Council meeting 2015-07-28 says:

    1. Undo the previous change to restore the original content model.

    2. Raise a separate ticket and go back to the linguistic community to ask whether s-units should be allowed to self-nest; if not, a Schematron rule should be created.

     

    Last edit: Martin Holmes 2015-07-28
  • Piotr Banski

    Piotr Banski - 2015-07-28

    Nice, thanks! I'll make sure to bring this up in the agenda for the upcoming LingSIG meeting -- or do you need action earlier (not likely --> summer...)

     
  • Martin Holmes

    Martin Holmes - 2015-07-29

    @Piotr: the sooner you can get us some feedback from the Ling folks the better, I think. Restoring the original content model will allow nesting again per the schema, but leave the issue of the prose constraint unaddressed; that should be either reinforced with a Schematron rule, or (if we believe s-units should be able to nest), deleted, as Kevin suggested.

     
  • Lou Burnard

    Lou Burnard - 2015-07-29

    It is rather naive to assume that "the linguistic community" is a single entity which can be consulted and which will reach a single conclusion. This particular case is a very good example. The reason we have both <s> (non-self-nesting) and <seg> (arbitrary segmentation) is that corpus linguists require the former, since they like to segment their corpora end to end irrespective of any other kind of structure, whereas more analytic linguists need to represent more complex hierarchic (or non hierarchic) segmentation. Yes, one is arguably a special case of the other, and yes, perhaps Occam's razor should have been wielded more effectively, but there are countless millions of words of TEI conformant corpora out there which rely this distinction and this requirement (bequeathed to us by the late Stig Johansson, I think). I really cannot see any argument in favour of allowing <s> to self-nest, if that is what is being proposed:and it would be a Birnbaum-breaking change to the conceptual model to permit it.

     
  • Martin Holmes

    Martin Holmes - 2015-07-29

    It surely wouldn't be a Birnbaum issue, would it? We wouldn't be rendering any existing conformant documents invalid; we'd just be allowing something that wasn't allowed before. Although actually it was, by the schema; it was just disallowed by the prose. What we did in changing the content model before was arguably Birnbaum-breaking, in that it made invalid documents which were using nested s.