Text Encoding Initiative / Bugs / #578 partial and recursive segmentation of s-units

Lou Burnard - 2013-06-05

The distinction between <s> and <seg> is precisely that the former
may not self-nest. In P3 and earlier SGML-based versions of the
Guidelines this eas enforced by means of an inclusion exception. In P4
it was not enforced, and the note you refer to was added. In P5 there is
a schematron rule to enforce this constraint, so I would question your
assertion that <s> can self-nest. The original intention btw was also
that <s> should provide an end-to-end segmentation of a text, but we
have not yet added a constraint to that effect.

Last edit: Kevin Hawkins 2013-06-05

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2013-06-05

If <s> is used together with <phr> and <w> to directly reflect the underlying syntactic constituent structure, it makes every sense to let <s> self-nest. It makes no sense not to let it self-nest, in fact.

Last edit: Piotr Banski 2013-11-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2013-06-05

I think we're looking at an unfortunate mixture of two interpretations of < s>: as a span within running text, and as a syntactic node in a syntactic representation. The note on < seg> that Kevin quotes might make some sense on the former interpretation. It doesn't make any sense whatsoever on the latter interpretation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Hawkins - 2013-06-05

I was looking at the content model, not the presence of any Schematron constraints. I see now that the content model of <s> uses macro.phraseSeq, so I assume that we decided it was more elegant to keep that and add a Schematron constraint rather than set up a content model that includes all of macro.phraseSeq except for <s>.

I suggest revising the note from:

For segmentation which is partial or recursive, the seg should be used instead.

to:

For end-to-end segmentation which is partial or recursive, seg should be used instead.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2013-06-05

What does "end-to-end" mean, please? Especially in connection with "partial".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Hawkins - 2013-06-05

I took that language from the previous sentence in the note in the element spec:

The s element may be used to mark orthographic sentences, or any other segmentation of a text, provided that the segmentation is end-to-end, complete, and non-nesting.

Lou used it as well, and I think it is being used the way we sometimes use "tessellating": that is, encoding all of the character data in exactly one instance of the element in question.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Cummings - 2013-11-09

What is needed to close this ticket? More clarity in the guidelines?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Hawkins - 2013-11-10

We need to decide whether we think my proposed wording in my comment above ( https://sourceforge.net/p/tei/bugs/578/#5f95 ) is actually clearer or just raises more questions. If it's clearer, we need to decide whether to accept it and who will implement.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lou Burnard - 2013-11-10

Sorry Kevin, but I find your rewording confusing. You can use <seg> for any kind of segmentation, not simply end-to-end segmentation. In fact it is quite plausible to have an end to end segmentation (a tesselation, if you prefer) done with <s> and then to nest <seg>s within them. And, as Piotr, suggests it makes little sense to talk about "partial" end-to-end-segmentation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2013-11-12

Council 2013-11-12: Action on MH to revise either the content model of s so that it doesn't nest (copying macro.phraseSeq and removing s), or removing s from macro.phraseSeq and replacing it manually everywhere macro.phraseSeq would put it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2013-11-12

assigned_to: Martin Holmes

Group: AMBER --> GREEN

Priority: 5 --> 1(low)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2013-11-27

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2013-11-27

I've implemented this at rev 12668, although I must say I don't like the results at all; the content model of <s> is now truly horrible, and will get out of sync with macro.phraseSeq if we're not careful. I would actually recommend reversing this decision and letting the Schematron do the job.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lou Burnard - 2015-06-30

status: closed-fixed --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lou Burnard - 2015-06-30

I'm reopening this because, like Martin who applied it, I think the fix has considerably more defects than the situation it is trying to improve. We just cannot have arbitrary lists of elements in content models which need to be kept in step with class definitions. Either we have to use a new class definition (which seems really silly), or we continue to rely on the (perfectly reasonable) schematron rule to implement the desired additional constraint that <s> cannot self-nest. This ticket is a consequence of someone failing to understand what is (ipso facto) poorly expressed in the Guidelines, so the way to resolve it is to improve that expression, not to cobble together a ridiculous content model which will come back to bite us every other day. For example: suppose we remove an element from model.phrase: it will still appear here. Suppose we define a new model.phrase element: it will not appear here. Finding out/remembering why this idiosyncratic behaviour occurs is a waste of everyone's time.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2015-06-30

I'm glad to see this. I don't believe any backward-compatibility issues will arise out of reversing this. The original fix made s-nesting invalid; we'll just be enforcing it through Schematron instead of messing up our content models. +1 from me.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2015-07-27

Let's put this on the agenda for the meeting tomorrow and get people's agreement on whether to reverse the original decision.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2015-07-28

Council meeting 2015-07-28 says:

Undo the previous change to restore the original content model.

Raise a separate ticket and go back to the linguistic community to ask whether s-units should be allowed to self-nest; if not, a Schematron rule should be created.

Last edit: Martin Holmes 2015-07-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2015-07-28

Nice, thanks! I'll make sure to bring this up in the agenda for the upcoming LingSIG meeting -- or do you need action earlier (not likely --> summer...)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2015-07-29

@Piotr: the sooner you can get us some feedback from the Ling folks the better, I think. Restoring the original content model will allow nesting again per the schema, but leave the issue of the prose constraint unaddressed; that should be either reinforced with a Schematron rule, or (if we believe s-units should be able to nest), deleted, as Kevin suggested.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lou Burnard - 2015-07-29

It is rather naive to assume that "the linguistic community" is a single entity which can be consulted and which will reach a single conclusion. This particular case is a very good example. The reason we have both <s> (non-self-nesting) and <seg> (arbitrary segmentation) is that corpus linguists require the former, since they like to segment their corpora end to end irrespective of any other kind of structure, whereas more analytic linguists need to represent more complex hierarchic (or non hierarchic) segmentation. Yes, one is arguably a special case of the other, and yes, perhaps Occam's razor should have been wielded more effectively, but there are countless millions of words of TEI conformant corpora out there which rely this distinction and this requirement (bequeathed to us by the late Stig Johansson, I think). I really cannot see any argument in favour of allowing <s> to self-nest, if that is what is being proposed:and it would be a Birnbaum-breaking change to the conceptual model to permit it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2015-07-29

It surely wouldn't be a Birnbaum issue, would it? We wouldn't be rendering any existing conformant documents invalid; we'd just be allowing something that wasn't allowed before. Although actually it was, by the schema; it was just disallowed by the prose. What we did in changing the content model before was arguably Birnbaum-breaking, in that it made invalid documents which were using nested s.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

partial and recursive segmentation of s-units

TEI produces the TEI Guidelines and associated software

Group

Searches

Help

#578 partial and recursive segmentation of s-units

Discussion