Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#378 Encoding of Standoff annotations

AMBER
open
Peter Stadler
5(default)
2015-03-17
2012-08-26
Javier Pose
No

The annotation of documents using standoff annotations is a very useful and flexible methodology. Nevertheless, TEI does not have any specific elements for encoding this information.
In most of cases, the standoff annotations are stored as external TEI files linked to the text being annotated. Nevertheless, this way of storing the standoff annotations is very rigid and presents numerous problems, for example, for indexing or searching the corpus of documents using the information of the annotations. In these cases, it would be very useful to have the standoff annotations INSIDE the TEI documents being annotated (!!!).

Therefore, it is suggested to include define a new set of TEI elements specifically dedicated to the encoding of the standoff annotations.

The idea would be to store the standoff annotations between the <teiHeader> and the <text>, following the same philosophy as used for the <facsimile> and for <sourceDoc> (in some way these two elements could also be considered as a "type" of annotation).

For the standoff annotation, the structure could be:

<TEI>
<teiHeader>
...
</texHeader>
<standoff>
[information of the annotations]
</standoff>
<text>
...
</text>
</TEI>

This structure would provide the extra advantage of allowing to annotate the information at different TEI levels in a natural manner. So for more complicated TEI documents having different hierarchical levels, the standoff annotations could be encoded as follows:

<teiCorpus>
<teiHeader>
...
</teiHeader>
<TEI>
<teiHeader>
...
</texHeader>
<standoff>
...
</standoff>
<text>
...
</text>
</TEI>
<TEI>
<teiHeader>
...
</texHeader>
<standoff>
...
</standoff>
<text>
...
</text>
</TEI>
</teiCorpus>

This structure would also provide the extra advantage of allowing to annotate, not only the text of the document, but also the metadata of the different hierarchical levels of the TEI document.

The specific encoding of the annotations inside <standoff> could be as follows:

<standoff>
<annotation type="..." subtype="...">
<author>...</author>
<date>...</date>
<ptr>...</ptr>
[other data needed]
</annotation>
</standoff>

As a last remark it is also suggested to allow inside the <annotation> the TEI element <figure> in order to facilitate the annotation not only of textual information, but also of images and formulas.

Conclusion: the proposed structure for the encoding of standoff annotations in TEI provides the following advantages:

- allows to encode standoff annotations under TEI in a natural manner, which is not the case at the moment
- allows to store the standoff annotations INSIDE the TEI document being annotated in a specific location, facilitating the process of indexing and
searching said documents
- it is naturally integrated in the hierarchical structure of the TEI (<teiCorpus>, <TEI>)
- allows to annotate both, the textual and not textual information, and also the metadata of the <teiHeader>
- facilitates the exchange of annotations because they are already stored in the original TEI document being annotated

Remark: this idea has been already suggested by Piotr Bański in his article "Why TEI stand-off annotation doesn't quite work and why you might want to use it nevertheless", in http://www.balisage.net/Proceedings/vol5/html/Banski01/BalisageVol5-Banski01.html

Related

Feature Requests: #378

Discussion

1 2 > >> (Page 1 of 2)
  • Lou Burnard
    Lou Burnard
    2012-09-16

    TEI already provides many elements for adding various kinds of standoff annotations (<link>, <certainty>, <join>, <alt>, <fs>, etc.) It doesn't provide any particular place for storing all such annotations, though this has been proposed at various times (in the days of SGML there was a proposal for something called a "LinkDataBlock" or <ldb> which I quite liked). I think the basic question would be : what advantage is there in creating such a special block? what does it provide that simply putting a <div type="links"> inside the <front> <body> or <back> doesn't? what use cases are there?

     
  • Lou Burnard
    Lou Burnard
    2012-09-16

    • milestone: --> AMBER
    • assigned_to: nobody --> bansp
     
  • the massive arguments against
    <div type="links"> inside the <front> <body> or <back>
    are that
    a) it relies on assumptions about values of @type, which must be entirely abhorrent to us. we cannot expect processors to "just know" things like that. "No Magic Here" must be our mantra.
    b) a <div> is a "subdivision of the front, body, or back of a text", NOT an arbitrary container as HTML's <div> is. This bunch of standoff stuff are not a subdivision of the text we are encoding.
    If you add a <div> full of links, and then ask "so how many subdivisions of the text are there", the answer will a spurious 1 more than expected.

    So I am much in favour of the entirely unambiguous freestanding container for this stuff. It costs us nothing, makes life much easier for processors, and provides part of the much-needed better guidance and support standoff-ish people.

     
  • Javier Pose
    Javier Pose
    2012-09-17

    I would be also against the idea of having <div type="..."> as a place for the stand-off annotations.
    The stand-off annotations are becoming more and more an essential piece of the encoding of any piece of textual information. Following the same philosophy as used for <facsimile> and for <sourceDoc>, the stand-off annotations (that are not a piece of the text themselves) must be stored in a "separate" place, different than the text. This makes much more clear the nature of the information and helps to encode and process it.

     
  • Laurent Romary
    Laurent Romary
    2012-09-17

    I tend to be with Sebastian and Javier and would definitely support the introduction of a new element. It is inline with similar mechanisms embedding representations external to the text proper and would bring so much fresh air for corpus linguistic people striving constantly to find a decent solution as to where to put such data.

     
  • Lou Burnard
    Lou Burnard
    2012-09-20

    OK, I agree that Sebastian's arguments are persuasive. Do we want this element to be another sibling of <text>, or would it be more plausible to put it inside <encodingDesc> or elsewhere in the <teiHeader>?

     
  • Laurent Romary
    Laurent Romary
    2012-09-20

    I see this as a child of text, since it is no metadata, but an additional layer to the data represented in the body a little like a facsimile is preliminary data to the transcribed content)

     
  • Javier Pose
    Javier Pose
    2012-09-20

    I would tend to put it as a new element between the <teiHeader> and the <text>, similarly to the <facsimile> or <sourceDoc>. The following reasons:
    1) I wouldn't include the stand-off annotations as part of the teiHeader, because teiHeader must supply the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text.The stand-off annotations comprise "extra" information, like the annotations themselves, that are not part of the text being described or metadata associated to it.
    Therefore, in order to clarify this difference I would suggest to put the annotations outside the <teiHeader>

    2) I woudn't include the stand-off annotations as part of the text, because the <text> contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample. In this case the stand-off annotations are text but not the text being encoded, but metadata and text related to the text being encoded. Therefore, in order to clarify this difference I would suggest to put the annotations outside the <text>

    Taking these reasons into account, I would tend to think that the best place for the standoff annotations (metadata + text associated to the text being encoded) would be in a new "area"between the <teiHeader> and the <text>. This is similar to the idea of, for example, <facsimile> which contains information that is nor directly the text being encoded neither its metadata, but a new "object" related to said text, i.e. the information about the facsimile of the text.

     
  • Laurent Romary
    Laurent Romary
    2012-09-21

    For the record. I meant like Javier: an element similar to <facsimile>, hence between header and text, namely a member of model.resourceLike

     
  • The Council meeting of 2012-09 agreed to the underlying request, to
    create this sibling of <teiHeader>; but without a decision about what to call it (<standoff>?) or what the content model should be. The latter needs a working party to agree a detailed spec.

     
  • Laurent Romary
    Laurent Romary
    2012-09-21

    <annotations>?
    The thing should have a very simple content model based on a model class, so that external vocabularies can be included easily as a customization.
    From a TEI internal point of view we should have a couple of examples where we gather typical annotation examples from the guidelines (spanGrp's etc.)

     
  • Javier Pose
    Javier Pose
    2012-09-28

    Hi Sebastian (or the responsible person),
    regarding the working party to agree a detailed specifications.
    Could it be possible that I take part in such a working team?
    I would be very interested.
    Regards,

     
  • yes indeed, this will be an open group, I think. we didnt decide at the meeting who would convene it

     
  • Javier Pose
    Javier Pose
    2012-09-28

    Thanks !!!
    Then I wait until getting some feedback about the final composition of the team.
    Do you have an idea how long it will take take to build such a group and start working?

     
  • Piotr Banski
    Piotr Banski
    2012-10-24

    Dear Javier, thanks for bringing the idea to the fore, I should have done that after publishing the paper, but waited for a "good moment", i.e. for an official opening of a LingSIG space at Sourceforge, but it's much better to have run this across the Council earlier, and get the green light.

    In a broad sense, the group that you might want to join to elaborate on the content of <standOff> [1] is the LingSIG:

    http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists

    I will update the page with new info, at the latest during the upcoming TEI Conference, possibly earlier.

     
  • Piotr Banski
    Piotr Banski
    2012-10-24

    As for naming, I vaguely recall that we may have decided to go for <standoff> (no camel case), after Sebastian pointed out that it functions as a single word.

    I have now become convinced that I must have applied a lot of wishful thinking when I interpreted our decision as "go ahead and put it in, we'll worry about the exact content later". Now I believe that I may have been the only one who was prepared to see this in the upcoming release, possibly because others were actually using their brains.

    OK, so now I interpret our decision as, basically, "take this to LingSIG, possibly under the supervision of the Council group enumerated in the minutes[1]". I will -- after tomorrow's release, I'll open the LingSIG space on SF.

    [1]: http://www.tei-c.org/Activities/Council/Meetings/tcm52.xml#body.1_div.2_div.5_div.1 (group B)

    (Everyone OK with that?)

     
  • Laurent Romary
    Laurent Romary
    2012-10-25

    Independently on how fast lingSIG will move ahead with this, I would have the feeling that we should implement this <standoff> step by step and be very pragmatic. For a start I would just define <standoff> with a simple content model based on a class model.standoffPart. Next step is to feed the class with TEI elements that are relevant there, namely things like spanGrp, interpGrp, linkGrp. Than see how the community implements this and requires further content (prediction: we'll have to deal with internal organisation of the thing, à la <div>; I promise a good can of worms, but not a reason to move ahead now).

     
  • Piotr Banski
    Piotr Banski
    2012-10-25

    My only concern is putting this can of worms *open* into the upcoming release. Can't recall anything like this done before (though I bet you might!).

     
  • Laurent Romary
    Laurent Romary
    2012-10-25

    It will not be a can of worms at this stage and we badly need the mechanism. I would suggest to implement it for the next release like suggested in my previous post. General agreement?

     
  • Piotr Banski
    Piotr Banski
    2012-10-25

    But a single element is not a mechanism...

    I'm going to open LingSIG space on SF after this release, basically copying the tagged branch, to be experimented on.
    How about focusing on the mechanism there, first, and proposing a sketch of the contents at the next F2F (which I may be present at or not, depending on the elections, but will still inform as a LingSIG convener).

    For one thing, I very strongly disagree with Javier's idea of putting metadata into <annotation> located under <standoff>. This is what the header is for...

    (Incidentally, I'm really glad you're saying that we need this badly, would you read the paper, too? I'm citing Nancy and you like crazy there, you're gonna like it, I'm calling you there my favourite French angel, for example)

     
  • Piotr Banski
    Piotr Banski
    2012-10-25

    But a single element is not a mechanism...

    I'm going to open LingSIG space on SF after this release, basically copying the tagged branch, to be experimented on.
    How about focusing on the mechanism there, first, and proposing a sketch of the contents at the next F2F (which I may be present at or not, depending on the elections, but will still inform as a LingSIG convener).

    For one thing, I very strongly disagree with Javier's idea of putting metadata into <annotation> located under <standoff>. This is what the header is for...

    (Incidentally, I'm really glad you're saying that we need this badly, would you read the paper, too? I'm citing Nancy and you like crazy there, you're gonna like it, I'm calling you there my favourite French angel, for example)

     
  • Javier Pose
    Javier Pose
    2012-10-25

    Hi Piotr,
    regarding your last comment, what type of metadata you don't want to have under <standoff>? What are you thinking to have in <standoff>?
    As far as I understood the annotation should have information like the pointer to which it refers, the data of the annotation and other information like author, date...
    I guess (correct me if wrong) that you are of the opinion that the information like author, date, and possible other (?) that you refer as metadata, should not be under <standoff>. If this is the case, I don't agree with it.
    As far as I understand, the <header> supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text. This element should contain all the metadata refering to the text it self. Now, when we consider the <annotation> this is another kind of "object" related to the text (like <facsimile> in some sense), and (in my opinion) it would be better to keep the metadata associated to each one separately. If we put the metadata in the <header> we will end up mixing different information (metadata) under the same structure, what can be very confusing.
    Since the metadata of the annotations is basically, author, date and a couple more of elements, I would be more in favour of keeping all encapsulated under the annotation.

     
  • Piotr Banski
    Piotr Banski
    2012-10-25

    Hi Javier (and Laurent),

    You're making my point, in a way: we're still far from a uniform vision of this. Part of the issue is that we are making different initial assumptions. In the article that Javier has quoted, my starting point (well, one of them) is simplifying the overall handling of annotation documents, by allowing <teiHeader, standoff> structures ('<>' for an ordered pair, this time). I mentioned that, for full flexibility and parallelism, we could also allow for <teiHeader, standoff, text>.

    Your concern regarding the header is partly valid. I don't see the problem with a 'doubled' author information as forcing a solution whereby some formal metadata is squeezed below the header -- one solution could be: if there's a conflict, keep them separated and linked virtually, in a fully standoff manner (after all, standoff was a solution to, among others, information container overlap).

    This is just a sample of the kind of discussion that we may have on this issue, and the possible compromises that we can come up with. Which is, as I said at the very beginning of this not, something that I see as a strong argument against implementing this fast and worrying later.

    I don't buy the argument that "we need it NOW". We have needed it NOW for the past, roughly counting, 17 years -- since the CES, if not before. Well, we will have it in a few days, in the LingSIG part of SF, open to experimentation and discussion. I believe that this *does* mean progress, without putting the entire architecture at risk, also from the point of view of the community's response to the Council's doings.

     
  • Martin Holmes
    Martin Holmes
    2012-10-25

    For the content model of <standoff>, don't forget the recently-added <listApp>.

     
  • Javier Pose
    Javier Pose
    2012-10-30

    Hi,
    has been the Working Group created?
    how can I join it?, is there a mailing list?

     
1 2 > >> (Page 1 of 2)