Text Encoding Initiative / Feature Requests / #378 Encoding of Standoff annotations

Lou Burnard - 2012-09-16

TEI already provides many elements for adding various kinds of standoff annotations (<link>, <certainty>, <join>, <alt>, <fs>, etc.) It doesn't provide any particular place for storing all such annotations, though this has been proposed at various times (in the days of SGML there was a proposal for something called a "LinkDataBlock" or <ldb> which I quite liked). I think the basic question would be : what advantage is there in creating such a special block? what does it provide that simply putting a <div type="links"> inside the <front> <body> or <back> doesn't? what use cases are there?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lou Burnard - 2012-09-16

milestone: --> AMBER

assigned_to: nobody --> bansp
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2012-09-16

the massive arguments against
<div type="links"> inside the <front> <body> or <back>
are that
a) it relies on assumptions about values of @type, which must be entirely abhorrent to us. we cannot expect processors to "just know" things like that. "No Magic Here" must be our mantra.
b) a <div> is a "subdivision of the front, body, or back of a text", NOT an arbitrary container as HTML's <div> is. This bunch of standoff stuff are not a subdivision of the text we are encoding.
If you add a <div> full of links, and then ask "so how many subdivisions of the text are there", the answer will a spurious 1 more than expected.

So I am much in favour of the entirely unambiguous freestanding container for this stuff. It costs us nothing, makes life much easier for processors, and provides part of the much-needed better guidance and support standoff-ish people.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Javier Pose - 2012-09-17

I would be also against the idea of having <div type="..."> as a place for the stand-off annotations.
The stand-off annotations are becoming more and more an essential piece of the encoding of any piece of textual information. Following the same philosophy as used for <facsimile> and for <sourceDoc>, the stand-off annotations (that are not a piece of the text themselves) must be stored in a "separate" place, different than the text. This makes much more clear the nature of the information and helps to encode and process it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Laurent Romary - 2012-09-17

I tend to be with Sebastian and Javier and would definitely support the introduction of a new element. It is inline with similar mechanisms embedding representations external to the text proper and would bring so much fresh air for corpus linguistic people striving constantly to find a decent solution as to where to put such data.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lou Burnard - 2012-09-20

OK, I agree that Sebastian's arguments are persuasive. Do we want this element to be another sibling of <text>, or would it be more plausible to put it inside <encodingDesc> or elsewhere in the <teiHeader>?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Laurent Romary - 2012-09-20

I see this as a child of text, since it is no metadata, but an additional layer to the data represented in the body a little like a facsimile is preliminary data to the transcribed content)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Javier Pose - 2012-09-20

I would tend to put it as a new element between the <teiHeader> and the <text>, similarly to the <facsimile> or <sourceDoc>. The following reasons:
1) I wouldn't include the stand-off annotations as part of the teiHeader, because teiHeader must supply the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text.The stand-off annotations comprise "extra" information, like the annotations themselves, that are not part of the text being described or metadata associated to it.
Therefore, in order to clarify this difference I would suggest to put the annotations outside the <teiHeader>

2) I woudn't include the stand-off annotations as part of the text, because the <text> contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample. In this case the stand-off annotations are text but not the text being encoded, but metadata and text related to the text being encoded. Therefore, in order to clarify this difference I would suggest to put the annotations outside the <text>

Taking these reasons into account, I would tend to think that the best place for the standoff annotations (metadata + text associated to the text being encoded) would be in a new "area"between the <teiHeader> and the <text>. This is similar to the idea of, for example, <facsimile> which contains information that is nor directly the text being encoded neither its metadata, but a new "object" related to said text, i.e. the information about the facsimile of the text.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Laurent Romary - 2012-09-21

For the record. I meant like Javier: an element similar to <facsimile>, hence between header and text, namely a member of model.resourceLike

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2012-09-21

The Council meeting of 2012-09 agreed to the underlying request, to
create this sibling of <teiHeader>; but without a decision about what to call it (<standoff>?) or what the content model should be. The latter needs a working party to agree a detailed spec.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Laurent Romary - 2012-09-21

<annotations>?
The thing should have a very simple content model based on a model class, so that external vocabularies can be included easily as a customization.
From a TEI internal point of view we should have a couple of examples where we gather typical annotation examples from the guidelines (spanGrp's etc.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Javier Pose - 2012-09-28

Hi Sebastian (or the responsible person),
regarding the working party to agree a detailed specifications.
Could it be possible that I take part in such a working team?
I would be very interested.
Regards,

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Rahtz - 2012-09-28

yes indeed, this will be an open group, I think. we didnt decide at the meeting who would convene it

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Javier Pose - 2012-09-28

Thanks !!!
Then I wait until getting some feedback about the final composition of the team.
Do you have an idea how long it will take take to build such a group and start working?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2012-10-24

Dear Javier, thanks for bringing the idea to the fore, I should have done that after publishing the paper, but waited for a "good moment", i.e. for an official opening of a LingSIG space at Sourceforge, but it's much better to have run this across the Council earlier, and get the green light.

In a broad sense, the group that you might want to join to elaborate on the content of <standOff> [1] is the LingSIG:

http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists

I will update the page with new info, at the latest during the upcoming TEI Conference, possibly earlier.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2012-10-24

As for naming, I vaguely recall that we may have decided to go for <standoff> (no camel case), after Sebastian pointed out that it functions as a single word.

I have now become convinced that I must have applied a lot of wishful thinking when I interpreted our decision as "go ahead and put it in, we'll worry about the exact content later". Now I believe that I may have been the only one who was prepared to see this in the upcoming release, possibly because others were actually using their brains.

OK, so now I interpret our decision as, basically, "take this to LingSIG, possibly under the supervision of the Council group enumerated in the minutes[1]". I will -- after tomorrow's release, I'll open the LingSIG space on SF.

[1]: http://www.tei-c.org/Activities/Council/Meetings/tcm52.xml#body.1_div.2_div.5_div.1 (group B)

(Everyone OK with that?)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Laurent Romary - 2012-10-25

Independently on how fast lingSIG will move ahead with this, I would have the feeling that we should implement this <standoff> step by step and be very pragmatic. For a start I would just define <standoff> with a simple content model based on a class model.standoffPart. Next step is to feed the class with TEI elements that are relevant there, namely things like spanGrp, interpGrp, linkGrp. Than see how the community implements this and requires further content (prediction: we'll have to deal with internal organisation of the thing, à la <div>; I promise a good can of worms, but not a reason to move ahead now).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2012-10-25

My only concern is putting this can of worms *open* into the upcoming release. Can't recall anything like this done before (though I bet you might!).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Laurent Romary - 2012-10-25

It will not be a can of worms at this stage and we badly need the mechanism. I would suggest to implement it for the next release like suggested in my previous post. General agreement?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2012-10-25

But a single element is not a mechanism...

I'm going to open LingSIG space on SF after this release, basically copying the tagged branch, to be experimented on.
How about focusing on the mechanism there, first, and proposing a sketch of the contents at the next F2F (which I may be present at or not, depending on the elections, but will still inform as a LingSIG convener).

For one thing, I very strongly disagree with Javier's idea of putting metadata into <annotation> located under <standoff>. This is what the header is for...

(Incidentally, I'm really glad you're saying that we need this badly, would you read the paper, too? I'm citing Nancy and you like crazy there, you're gonna like it, I'm calling you there my favourite French angel, for example)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2012-10-25

But a single element is not a mechanism...

I'm going to open LingSIG space on SF after this release, basically copying the tagged branch, to be experimented on.
How about focusing on the mechanism there, first, and proposing a sketch of the contents at the next F2F (which I may be present at or not, depending on the elections, but will still inform as a LingSIG convener).

For one thing, I very strongly disagree with Javier's idea of putting metadata into <annotation> located under <standoff>. This is what the header is for...

(Incidentally, I'm really glad you're saying that we need this badly, would you read the paper, too? I'm citing Nancy and you like crazy there, you're gonna like it, I'm calling you there my favourite French angel, for example)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Javier Pose - 2012-10-25

Hi Piotr,
regarding your last comment, what type of metadata you don't want to have under <standoff>? What are you thinking to have in <standoff>?
As far as I understood the annotation should have information like the pointer to which it refers, the data of the annotation and other information like author, date...
I guess (correct me if wrong) that you are of the opinion that the information like author, date, and possible other (?) that you refer as metadata, should not be under <standoff>. If this is the case, I don't agree with it.
As far as I understand, the <header> supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text. This element should contain all the metadata refering to the text it self. Now, when we consider the <annotation> this is another kind of "object" related to the text (like <facsimile> in some sense), and (in my opinion) it would be better to keep the metadata associated to each one separately. If we put the metadata in the <header> we will end up mixing different information (metadata) under the same structure, what can be very confusing.
Since the metadata of the annotations is basically, author, date and a couple more of elements, I would be more in favour of keeping all encapsulated under the annotation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Banski - 2012-10-25

Hi Javier (and Laurent),

You're making my point, in a way: we're still far from a uniform vision of this. Part of the issue is that we are making different initial assumptions. In the article that Javier has quoted, my starting point (well, one of them) is simplifying the overall handling of annotation documents, by allowing <teiHeader, standoff> structures ('<>' for an ordered pair, this time). I mentioned that, for full flexibility and parallelism, we could also allow for <teiHeader, standoff, text>.

Your concern regarding the header is partly valid. I don't see the problem with a 'doubled' author information as forcing a solution whereby some formal metadata is squeezed below the header -- one solution could be: if there's a conflict, keep them separated and linked virtually, in a fully standoff manner (after all, standoff was a solution to, among others, information container overlap).

This is just a sample of the kind of discussion that we may have on this issue, and the possible compromises that we can come up with. Which is, as I said at the very beginning of this not, something that I see as a strong argument against implementing this fast and worrying later.

I don't buy the argument that "we need it NOW". We have needed it NOW for the past, roughly counting, 17 years -- since the CES, if not before. Well, we will have it in a few days, in the LingSIG part of SF, open to experimentation and discussion. I believe that this *does* mean progress, without putting the entire architecture at risk, also from the point of view of the community's response to the Council's doings.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Holmes - 2012-10-25

For the content model of <standoff>, don't forget the recently-added <listApp>.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Javier Pose - 2012-10-30

Hi,
has been the Working Group created?
how can I join it?, is there a mailing list?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Encoding of Standoff annotations

TEI produces the TEI Guidelines and associated software

Group

Searches

Help

#378 Encoding of Standoff annotations

Related

Discussion