Menu

#378 Encoding of Standoff annotations

AMBER
open
5(default)
2015-05-30
2012-08-26
Javier Pose
No

The annotation of documents using standoff annotations is a very useful and flexible methodology. Nevertheless, TEI does not have any specific elements for encoding this information.
In most of cases, the standoff annotations are stored as external TEI files linked to the text being annotated. Nevertheless, this way of storing the standoff annotations is very rigid and presents numerous problems, for example, for indexing or searching the corpus of documents using the information of the annotations. In these cases, it would be very useful to have the standoff annotations INSIDE the TEI documents being annotated (!!!).

Therefore, it is suggested to include define a new set of TEI elements specifically dedicated to the encoding of the standoff annotations.

The idea would be to store the standoff annotations between the <teiHeader> and the <text>, following the same philosophy as used for the <facsimile> and for <sourceDoc> (in some way these two elements could also be considered as a "type" of annotation).

For the standoff annotation, the structure could be:

<TEI>
<teiHeader>
...
</texHeader>
<standoff>
[information of the annotations]
</standoff>
<text>
...
</text>
</TEI>

This structure would provide the extra advantage of allowing to annotate the information at different TEI levels in a natural manner. So for more complicated TEI documents having different hierarchical levels, the standoff annotations could be encoded as follows:

<teiCorpus>
<teiHeader>
...
</teiHeader>
<TEI>
<teiHeader>
...
</texHeader>
<standoff>
...
</standoff>
<text>
...
</text>
</TEI>
<TEI>
<teiHeader>
...
</texHeader>
<standoff>
...
</standoff>
<text>
...
</text>
</TEI>
</teiCorpus>

This structure would also provide the extra advantage of allowing to annotate, not only the text of the document, but also the metadata of the different hierarchical levels of the TEI document.

The specific encoding of the annotations inside <standoff> could be as follows:

<standoff>
<annotation type="..." subtype="...">
<author>...</author>
<date>...</date>
<ptr>...</ptr>
[other data needed]
</annotation>
</standoff>

As a last remark it is also suggested to allow inside the <annotation> the TEI element <figure> in order to facilitate the annotation not only of textual information, but also of images and formulas.

Conclusion: the proposed structure for the encoding of standoff annotations in TEI provides the following advantages:

- allows to encode standoff annotations under TEI in a natural manner, which is not the case at the moment
- allows to store the standoff annotations INSIDE the TEI document being annotated in a specific location, facilitating the process of indexing and
searching said documents
- it is naturally integrated in the hierarchical structure of the TEI (<teiCorpus>, <TEI>)
- allows to annotate both, the textual and not textual information, and also the metadata of the <teiHeader>
- facilitates the exchange of annotations because they are already stored in the original TEI document being annotated

Remark: this idea has been already suggested by Piotr Bański in his article "Why TEI stand-off annotation doesn't quite work and why you might want to use it nevertheless", in http://www.balisage.net/Proceedings/vol5/html/Banski01/BalisageVol5-Banski01.html

Related

Feature Requests: #378

Discussion

<< < 1 2 (Page 2 of 2)
  • Lou Burnard

    Lou Burnard - 2013-06-18

    I entirely agree that this new element doesn't belong inside <text>. However, I am less convinced that it doesn't belong inside <teiHeader>. Javier says above "The stand-off annotations comprise extra information, ... that are not part of the text being described or metadata associated to it" But surely that extra information is precisely "metadata" -- information about the text? And we already have a lot of "standoffish" elements in the header, which are nothing to do with the "titlepage" aspects e.g. <particDesc> <listPlace> etc. I think there's a big difference between things like <facsimile> or <sourceDoc> which are different views of the same textual object, and the header which is at a different "meta" level.

     

    Last edit: Lou Burnard 2013-06-18
  • Martin Holmes

    Martin Holmes - 2013-06-18

    I don't think ancillary textual content is the same thing as metadata at all. Editorial annotations, prosopographical information etc. is not part of the core source document, but it is part of the text in another sense. It would not normally be put in a library catalogue, and it would traditionally be printed either on the page (as footnotes etc.) or in appendices. So I don't think it belongs in teiHeader.

     
  • Javier Pose

    Javier Pose - 2013-06-18

    As I already indicated, I think that the annotation shouln't be part of the teiHeader. It is important to stablish a conceptual differentiation between the metadata of the text (i.e. information about the core nature of the source text) and the annotations (i.e. added information NOT "naturally" linked to the original text and which has been created, probably latter on, with some specific meaning not directly related to the specific nature of the text). In some way the annotations could be seen as "postit" markers that provide information about the source text or parts of said source text. It would be some how estrange to put the information of these postits as part of the teiHeader of the source document (!!!). Following this simile, one could think the annotations as very small kind-of-documents linked to the source text. Now, one interesting question would be, when an "annotation" has enough "entity" to be considered as an independent document? This issue could be clarified latter on. For the moment, and in view of the previous reasons, I would think that the annotations shouldn't be part of the teiHeader

     
  • Lou Burnard

    Lou Burnard - 2013-06-18

    A related question in my mind is that of other kinds of descriptive data
    or annotation, such as RDF triples defining properties of the text
    content in some ontology. Would those be included within the proposed
    element?

    On 18/06/13 13:56, Javier Pose wrote:

    As I already indicated, I think that the annotation shouln't be part
    of the teiHeader. It is important to stablish a conceptual
    differentiation between the metadata of the text (i.e. information
    about the core nature of the source text) and the annotations (i.e.
    added information NOT "naturally" linked to the original text and
    which has been created, probably latter on, with some specific meaning
    not directly related to the specific nature of the text). In some way
    the annotations could be seen as "postit" markers that provide
    information about the source text or parts of said source text. It
    would be some how estrange to put the information of these postits as
    part of the teiHeader of the source document (!!!). Following this
    simile, one could think the annotations as very small
    kind-of-documents linked to the source text. Now, one interesting
    question would be, when an "annotation" has enough "entity" to be
    considered as an independent document? This issue could be clarified
    latter on. For the moment, and in view of the previous reasons, I
    would think that the annotations shouldn't be part of the teiHeader


    [feature-requests:#378]
    http://sourceforge.net/p/tei/feature-requests/378/ Encoding of
    Standoff annotations

    Status: open
    Labels: TEI: New or Changed Element
    Created: Sun Aug 26, 2012 10:01 PM UTC by Javier Pose
    Last Updated: Tue Jun 18, 2013 12:35 PM UTC
    Owner: Piotr Banski

    The annotation of documents using standoff annotations is a very
    useful and flexible methodology. Nevertheless, TEI does not have any
    specific elements for encoding this information.
    In most of cases, the standoff annotations are stored as external TEI
    files linked to the text being annotated. Nevertheless, this way of
    storing the standoff annotations is very rigid and presents numerous
    problems, for example, for indexing or searching the corpus of
    documents using the information of the annotations. In these cases, it
    would be very useful to have the standoff annotations INSIDE the TEI
    documents being annotated (!!!).

    Therefore, it is suggested to include define a new set of TEI elements
    specifically dedicated to the encoding of the standoff annotations.

    The idea would be to store the standoff annotations between the
    <teiHeader> and the , following the same philosophy as used for
    the <facsimile> and for <sourceDoc> (in some way these two elements
    could also be considered as a "type" of annotation).

    For the standoff annotation, the structure could be:

    <TEI>
    <teiHeader>
    ...
    </texHeader>
    <standoff>
    [information of the annotations]
    </standoff>

    ...

    </TEI>

    This structure would provide the extra advantage of allowing to
    annotate the information at different TEI levels in a natural manner.
    So for more complicated TEI documents having different hierarchical
    levels, the standoff annotations could be encoded as follows:

    <teiCorpus>
    <teiHeader>
    ...
    </teiHeader>
    <TEI>
    <teiHeader>
    ...
    </texHeader>
    <standoff>
    ...
    </standoff>

    ...

    </TEI>
    <TEI>
    <teiHeader>
    ...
    </texHeader>
    <standoff>
    ...
    </standoff>

    ...

    </TEI>
    </teiCorpus>

    This structure would also provide the extra advantage of allowing to
    annotate, not only the text of the document, but also the metadata of
    the different hierarchical levels of the TEI document.

    The specific encoding of the annotations inside <standoff> could be as
    follows:

    <standoff>
    <annotation type="..." subtype="...">
    <author>...</author>
    <date>...</date>
    <ptr>...</ptr>
    [other data needed]
    </annotation>
    </standoff>

    As a last remark it is also suggested to allow inside the <annotation>
    the TEI element

    in order to facilitate the annotation not
    only of textual information, but also of images and formulas.

    Conclusion: the proposed structure for the encoding of standoff
    annotations in TEI provides the following advantages:

    • allows to encode standoff annotations under TEI in a natural manner,
      which is not the case at the moment
    • allows to store the standoff annotations INSIDE the TEI document
      being annotated in a specific location, facilitating the process of
      indexing and
      searching said documents
    • it is naturally integrated in the hierarchical structure of the TEI
      (<teiCorpus>, <TEI>)
    • allows to annotate both, the textual and not textual information,
      and also the metadata of the <teiHeader>
    • facilitates the exchange of annotations because they are already
      stored in the original TEI document being annotated

    Remark: this idea has been already suggested by Piotr Bański in his
    article "Why TEI stand-off annotation doesn't quite work and why you
    might want to use it nevertheless", in
    http://www.balisage.net/Proceedings/vol5/html/Banski01/BalisageVol5-Banski01.html


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/tei/feature-requests/378/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Feature Requests: #378

  • Javier Pose

    Javier Pose - 2013-07-29

    Hello,
    I have been thinking these days about a possible structure for the "standoff" element and a general framework for encoding standoff annotations in TEI.

    In order to have a more clear proposal, I wrote a working document explaining in detail a proposed structure for encoding standoff annotations in <standoff>. I attach the document to this message
    I hope this can help for further discussions.

    Regards,

     
  • Sebastian Rahtz

    Sebastian Rahtz - 2013-07-29

    Can I suggest you publish this more widely, Javier, and tell people on TEI-L about it? this really deserves careful reading by many people.

     
  • Piotr Banski

    Piotr Banski - 2013-07-29

    Got a notification of Sebastian's post, but not of Javier's. Earlier today, I received Javer's document and I find it impressive. We'll try to open it for discussion from the SIG pages, if that is OK with everyone.

     
  • Martin Holmes

    Martin Holmes - 2013-07-29

    Thanks Javier for your very detailed description of the proposal. I have a couple of immediate responses to it:

    1. I think it should be called <standoff>. There's no reason to abbreviate, because this is not an element that's going to crop up dozens of times in a document, and it's really not clear what <stf> might mean if you don't already know.

    2. I think it should also be available as a sibling of <teiHeader> inside <teiCorpus>. I would imagine that a lot of the kind of data appearing in this element would be applicable to all the documents in a corpus.

     
  • Javier Pose

    Javier Pose - 2013-07-31

    Hi Martin, regarding your comments:

    1. I agree, the main standoff annotations block should be called "standoff" (the document I released also agrees with this, see section 5). The elements "stf" and "stfGrp" are child of this main element for implementing the hierarchical standoff annotations and allowing the grouping of them.

    2. I am still not sure whether the metadata of the standoff annotations should appear in the teiHeader. In principle I like the idea of considering the standoff anotations as independent entities, so I would like to have the annotations as "atomic" as possible, i.e. self contained. In this approach also helps the fact of having "fsdDecl" as an element that can also be independent of teiHeader, so if further metadata about the annotations is needed (in the case of the feature structures) it can be defined outside teiHeader. In any case, I am still not 100% convince of this, so it could also be an idea to have part of the standoff metadata in the teiHeader (in the document this is also disclosed as an Open Issue, see section 8)

     
  • Piotr Banski

    Piotr Banski - 2013-10-10
    • labels: TEI: New or Changed Element --> TEI: New or Changed Element, LingSIG
    • Priority: 5 --> 1(low)
     
  • Piotr Banski

    Piotr Banski - 2013-10-10
    • Priority: 1(low) --> 5(default)
     
  • Piotr Banski

    Piotr Banski - 2013-10-10

    (Resetting the priority back to "5", with apologies for the incident, while pointing my finger at the SF maintainers.)

     
  • Piotr Banski

    Piotr Banski - 2013-11-09

    There is going to be a meeting devoted to this issue, in Jan/Feb 2014 in Berlin, with, minimally, Javier, Laurent and Piotr, hopefully also Andreas Witt, hopefully a Council representative, and surely several other colleagues interested in the topic from various angles.

    The Council is obviously going to be informed about the results of that meeting, either via its representative, or with a report from us. We hope that these results will be taken into consideration when the content of <standOff> (or <standoff>, or whatever it ends up called) is decided on.

     

    Last edit: Piotr Banski 2013-11-09
  • Syd Bauman

    Syd Bauman - 2013-11-13

    Some quick comments …

    I very much like the idea of a new linked-data-block kind of container element, although I'm not sure that <standoff> is sufficiently generic. One can imagine that a lot of useful stuff would get tucked into this space. The already mentioned <spanGrp>, <interpGrp>, <linkGrp>, of course; but also contextual information (<listPerson>, <listPlace>, etc.), specialized annotations (<listChange>, <witDetail>), and “phantom” text (e.g., <castItem>s that do not appear in the source text, the words spelled out by an acrostic).

    I mildly prefer making this new container an optional child of <text> before <front>, but could easily be talked into the “between <teiHeader> and <text>” idea. I shy away from putting it in the <teiHeader>.

    I have not figured out what the difference is between the suggested <annotation> element and the TEI <note> element. At first blush, they look the same.

    I have not paid attention to the OAC for some time now, but we should take a peek at what they're doing and also brush up on XStandoff before settling on anything.

     
  • Piotr Banski

    Piotr Banski - 2014-01-31

    This is just to note that we have just finished a very successful (imnsho) meeting devoted to these issues, hosted and chaired by Laurent Romary at HUB, with life-maintaining infrastructural support from Carolin Odebrecht. The results will be reported on to the Council by Peter Stadler, who has been of great help, by keeping the minutes and editing the ODD, and raising valid issues all at the same time. Members of the LingSIG, which was happy (well, eager) to provide the small-scale "institutional" umbrella for the relevant post-meeting activities, will be notified and queried as well. The same goes for the TEI community at large, in due time (which the Council will probably determine for us).

    The ball is rolling and we will be hoping for the Council's comments and hints after the upcoming F2F meeting, if not earlier.

     
  • Peter Hinkelmanns

    Dear all, I'm very interested in using TEI-standoff for linguistic annotation for historic data (manuscripts, 15th–17th century). I'd like to ask what's the current status on that proposal?

    \<standoff> in my opinion is neither \<header> nor \<text> content. I would differentiate between meta elements like a manuscript description and actual metadata on the text. The element \<standoff> could be used for segmentation information like word tokenization and statements regarding those segments like POS-Tagging – at least to my understanding.

    Thank you!

     
  • Laurent Romary

    Laurent Romary - 2015-03-14

    I put together a Github project with ODD spec and examples for review and discussion: https://github.com/laurentromary/stdfSpec
    It would be good to move a head slowly with this.

     
  • Laurent Romary

    Laurent Romary - 2015-03-14

    @PeterHinkelmanns: please provide possible examples that could be used as possible application of the proposed element.

     
  • Hugh A. Cayless

    Hugh A. Cayless - 2015-03-17
    • assigned_to: Piotr Banski --> Peter Stadler
     
  • Hugh A. Cayless

    Hugh A. Cayless - 2015-03-17

    Assigning to Peter to get this moving again.

     
  • Lou Burnard

    Lou Burnard - 2015-05-30

    Standoff has moved to a proposed implementation now available at https://github.com/laurentromary/stdfSpec

     
  • Peter Stadler

    Peter Stadler - 2015-05-30

    Council working group (PFS, LB, MH, FC, SM, PWS) created an alternative proposal as the "Ann Arbor" branch at https://github.com/laurentromary/stdfSpec/tree/AnnArbor

     
<< < 1 2 (Page 2 of 2)