Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#346 @cRef is a mess

GREEN
closed-accepted
Lou Burnard
5
2012-09-23
2012-01-27
Martin Holmes
No

@cRef occurs on <gloss>, <term>, <ptr>, and <ref>. There are several issues with it:

1. It is separately defined in each element, rather having an attribute class.

2. The definitions and datatypes vary: <gloss> and <term> have a single data.pointer, while <ptr> and <ref> have unbounded data.word.

3. The <valDesc>s are out of sync with the datatypes. For instance, the <valDesc> on <ref>/@cRef says "Currently these Guidelines only provide for a single canonical reference to be encoded on any given ref element," whereas the datatype has @maxOccurs="unbounded".

I would suggest:

- A single attribute class containing this attribute, so we can standardize the datatype and description.

- The datatype should either be one-to-many data.word or one-to-many data.pointer, whichever is more appropriate for this usage. Either will do, I think, but in the case of the latter, the canonical reference would have to comply with pointer syntax, presumably using a private prefix. Pointers would also be easier for validators to enforce.

Discussion

  • Lou Burnard
    Lou Burnard
    2012-01-29

    @cRef is indeed a mess, and not just for the reasons given. Leaving aside whether or not it can take multiple values (and what it would mean if it did), it certainly isn't a data.pointer -- if it were, it would be indistinguishable from @ref Its function is to identify some location or span in a document in terms of a "canonical referencing scheme" which is (more or less) an early attempt by the TEI to invent Xpath in more user-friendly terms. See example in the element spec for refsDecl. I agree it would be better defined by means of a class, but I am not convinced there's much of a case for retaining it at all till we've thought through what exactly it's for in this day and age.

     
  • I can see the point of multiple occurrences - some worlds may have competing canonical schemes (mad though that sounds). but I suspect it is multiple data.word to allow for schemes where there is a space in the value ("John 22 3").

     
  • James Cummings
    James Cummings
    2012-02-01

    I disagree with Lou that it shouldn't necessarily be data.pointer. The reason I do this is it ALREADY is data.pointer on <term> and <gloss> but it is data.word on <ptr> and <ref>.

    But I do agree with him that it needs to be looked at again to see if it is necessary, and what purpose it is really serving in modern TEI documents.

     
  • Laurent Romary
    Laurent Romary
    2012-02-02

    Well, on term, the guidelines actually provide no example and I am just wondering to which purpose I would actually use this when @corresp on a gloss would just work.
    How many people actually use @cRef?....

     
  • I wasn't going to bring this up until I was further in my project, but maybe I should say something now.

    My project is to encode and then display parallel corpora. The trick is to deal with the countless sorts of misalignment that can occur. My current approach is working well so far. For each corpus, I adopt a canonical segmentation. This usually exists already, but if not, the encoder picks one. Then all DIVs (or DIV-like things) in each text get tagged with a starting cRef if the beginning of the subject DIV lines up with the beginning of the canonical DIV, an ending cRef if the ends line up, or both, or neither. The stylesheet uses these to align the texts. No single encoded text need exactly use the canonical segmentation. The element refsDecl tells encoders of subsequent text (another translation, another edition, etc.) what canonical segmentation to use.

    At first, I put the starting cRef in <head> and the ending cRef in <tail>, but semantically that's bad. Then I learned how to use Roma and so I added cRefStart and cRefEnd as attributes to DIV and DIV-like things. I now think I should also add cRefMiddle, but I haven't done that yet.

    I have written enough of the XSLT alignment code (a lot of graph algorithms in XSLT!) to convince myself the model is pretty good. I can successfully handle many kinds of misalignment. And the few cases I cannot handle could get covered by addition of a cRefMiddle.

    A cRefStart and a cRefEnd can be any string, and it can include spaces, so "Editor's Preface", "Translator's Preface", "John 3:16", "Aphorism XII" are all valid and all indicate just one cRef. There cannot be more than one cRefStart or cRefEnd for each DIV. I do not allow more than one canonical segmentation per corpus.

    As I said, I don't have it all working yet, but I figured that if it does work, I would propose adding these markers to the standard or at least let people know about them and share my stylesheets. I know there is a lot of interest in parallel corpora and it really is a gnarly problem. I'm finding something in the spirit of cRef is helping me solve it.

     
  • Martin Holmes
    Martin Holmes
    2012-02-03

    @John: I wonder if @corresp would do instead of @cRef in this context? What do your @cRef values look like?

     
  • Martin Holmes
    Martin Holmes
    2012-02-03

    I should point people at my survey from here too:

    http://www.surveymonkey.com/s/PW5YYH6

    Anyone interested in this topic could help by filling in the survey.

     
  • The description of @corresp says that it "points to elements that correspond to the current element in some way." But in my case, the correspondence need not be to anything in any element or any part of the encoded text at all. The correspondence is to some canonical segmentation that exists outside the encoded corpus.

    My @cRef values could be "Book blurb", "Author's Introduction", "Editor's Preface", "chap 1", "chap 2". They could be "lines 1-9", "lines 10-19", "lines 20-29". They could be "Introduction", "aI", "aII", "aIV", "aVI". The could be "Acts 4:20", "Acts 4:21," "4:25".

    Notice that the order cannot be determined just by looking at the cRefs. They cannot be sorted alphabetically or numerically. And they cannot be sorted by reference to a complete, encoded canonical text (where @corresp would make sense), since such a text might not be in the encoded corpus. The references need to be sorted by treating each text as providing a sorted list and then merging sorted lists. The result also needs to retain ambiguities. That is, the canon might be ordered alpha, beta, gamma, delta, but one encoded text might include just divisions alpha, beta, and delta and the other include just alpha, gamma, and delta.

     
  • Martin Holmes
    Martin Holmes
    2012-02-03

    @John: I take your point that your @cRef values are not pointers; but in the form you've quoted them, they're not strictly speaking correct @cRef values either. "chap 1" is in fact two values, "chap" and "1", which would be expected to point to different things (where @cRef values are one-to-infinity instances of data.word, space-separated).

    But another point is that these perhaps ought to point to something. @cRef is supposed to be used with a <cRefPattern> element which explains how to turn the @cRef value(s) into a meaningful pointer. If yours can't be "dereferenced" in this way, they really amount to annotation, don't they? On the other hand, if you XSLT is able to read and interpret these values, and turn them into something meaningful, then they are surely pointing at something, aren't they?

     
  • They are references to canonical divisions, but you are right, my @cRefStart and @cRefEnd values would not be valid @cRef values.

    You could say they "point" to something, but what they point to does not exist (or is not presumed to exist) in any XML that the XSLT can see. It exists on paper or a computer screen in the encoder's office.

    I was using refsDecl to record, just with a textual description, which segmentation is canonical (e.g., "Segmentation follows the edition by McKechnie, Glasgow, 1914"). I haven't used cRefPattern at all.

     
  • Martin Holmes
    Martin Holmes
    2012-02-03

    I think I'd be inclined to build an external file in which all of these external canonical divisions were modelled (or put them in the header). But I think what you're encoding, if I understand correctly, is a set of milestones:

    "<milestone> marks a boundary point separating any kind of section of a text, typically but not necessarily indicating a point at which some part of a standard reference system changes, where the change is not represented by a structural element. [3.10.3 Milestone Elements]"

    Among <milestone>'s attributes is @ed, which "supplies an arbitrary identifier for the source edition in which the associated feature (for example, a page, column, or line break) occurs at this point in the text", which would allow you to point at a specific witness or edition; and @corresp, which would allow you to specify a location in that edition. A basic model of the edition itself (just its major structural divisions) could be used to provide targets for @corresp.

    But all this may be making your life much more complicated than you want it to be. :-)

     
  • After considering the possible uses and looking at what others have tried to do with parallel texts, I decided against an external cross-reference file. Instead I use the canonical division (which likely already established) and say to encoders, "Just tag your divisions with reference to this standard, and your texts will align with the texts encoded by others."

    I considered milestones. But I concluded that the atomic unit for alignment would best be a structural division as a unit and not a point in the text. Bekker numbers (page, column and line number) in Aristotle, for example, a perfect example of a milestone, would not be used for alignment, but book and chapter numbers would be. I display Bekker numbers but I don't use them for alignment.

    This choice to use structural units for alignment rather than textual milestones promised to solve several alignment problems, and so far, it's working out well. I'm optimistic that I'll succeed where others have been stymied.

    Separately then, could I use <milestone> to encode the structural boundaries? I could but decided against it. If the marks are really attributes of the division, they should be attributes of the DIV.

    Said differently, I found it better to have a DIV announce to the world, "I begin where canonical chapter 1 begins and I end where canonical chapter 2 ends," than for a DIV to stay silent and have those divisions marked in the text.

    The structural alignment also seems to match the way users think of parallel texts: "This edition combined chapters 2 and 3 into one chapter." "This translation left out the prologue of each section." "This manuscript skipped paragraph 3 but included the epilogue."

     
  • Lou Burnard
    Lou Burnard
    2012-03-13

    I find John's proposed use case for @cRef on <div> vel sim is quite persuasive, but that's not really relevent to Martin's comments on its use on <ref> or <ptr>. I also agree that if we retain cRef in any form it should be supplied by a class, should not be a pointer, and should be consistently applied. FWIW I think the current usage on <gloss> and <term> is wrong, and I also think that the datatype "unbounded word" does not imply that a cRef can have multiple targets (it doesn't imho have targets anyway). Quite a lot to sort out here, as previously noted.

     
  • Lou Burnard
    Lou Burnard
    2012-03-13

    • milestone: --> 871212
     
  • James Cummings
    James Cummings
    2012-06-29

    • assigned_to: nobody --> gabrielbodard
     
  • James Cummings
    James Cummings
    2012-09-21

    Council accepts this at face to face meeting 2012-09, except the datatype, in should be data.text.

     
  • James Cummings
    James Cummings
    2012-09-21

    • milestone: 871212 --> GREEN
    • status: open --> open-accepted
     
  • Lou Burnard
    Lou Burnard
    2012-09-21

    Agrred that we need to define a new attribute class to supply @cref with datatype of data.text, with members the current elements carrying @cref only.

     
  • Lou Burnard
    Lou Burnard
    2012-09-21

    • assigned_to: gabrielbodard --> louburnard
     
  • Lou Burnard
    Lou Burnard
    2012-09-23

    Att rev. 10861 implemented new class att.cReferencing which supplies @cRef with datatype of data.text; added ptr, ref, gloss, and term to this class.

     
  • Lou Burnard
    Lou Burnard
    2012-09-23

    • status: open-accepted --> closed-accepted