Australian National Corpus Wiki

An ongoing project to collate and provide access to language data

Brought to you by: gweis, mfallu, rebollor, shirren, stevecassidy

Annotations as RDF

Annotation Schema

The annotation schema should be based on the ISO LAF standard since that's as close as we have to a standard model in this space. However, since LAF has never
been tried on multimodal data (with timestamps rather than character offsets) then it needs some adaptation. In addition, the LAF standards document is XML centric and some changes are needed to move it to a more abstract model that can be realised in RDF.
Entities

Corpus - has no analogue in LAF since they talk exclusively about individual graphs corresponding to single items in the corpus. However, we want to attach metadata to each corpus and know which corpus each item belongs to so we need this entity.
CorpusItem - corresponds to a LAF graph, has associated metadata (graf:cesHeader - uses the CES vocabulary which is derived from TEI).

Node - a node in the graph which can be linked to the data via an anchor and can have zero or more annotations attached to it. Corresponds to a LAF node.

Edge - an edge in the graph which can have associated annotations. I'm not convinced that this is needed having never seen a use-case for annotations on edges that wasn't just a typed relation. These are better modelled as real relations. So, while we might keep this in the schema, we will use typed relations between nodes instead.

Annotation - a container for a feature-value structure attached to a Node. While there may be more than one annotation attached to a Node I've not yet seen a use case that requires it. Annotations may also be attached to Edges but see the note there about how that's not really needed.

Locator - a means of locating a point, segment or region in source data using some referencing mechanism. In GrAF there are only Regions that use character offsets. We need to generalise this for character offsets and time based offsets.

UTF8Region - a kind of Locator defined by a start and end offset in utf8characters
UTF16Region - a kind of Locator defined by a start/end in utf16characters
MillisecondRegion - a Locator with start/end in milliseconds

Relations

graf:type -

domain: Annotation,
range: domain specific annotation types eg. xces:vchunk
Corresponds to the label property of annotations in LAF
but also use the annotation set name as a namespace

graf:annotates -

domain: Annotation,
range: Node
This annotation relates to this node.

graf:targets -

domain: Node
range: Locator
corresponds to graf:link in the XML, relates a Node to the region of the source document that it describes. in Graf, the range of this relation is Region but we generalise that to Locator

graf:from, graf:to -

domain: Edge
range: Node
if we make Edges into explicit resources, then the from/to relations point to each end of the relation. I propose we don't use these in AusNC unless there is a demonstrated need

graf:child -

domain: Node
range: Node
instead of using an Edge with graf:type graf:child we use an explicit relationship of this name todenote the parent/child link between nodes. This is parent/child in the sense of hierarchical containment eg. a Sentence node with children that are Word nodes.

graf:start, graf:end -

domain: Locator
range: numerical value or other reference type
defines the start or end of a region, interpretation of the value is dependant on the type of Locator, so for UTF8Region they are characters but for MillisecondRegion they are milliseconds

domain specific relations denoting features :
for example xces:tense, xces:voice, these are defined by a kind of annotation schema that might be shared between different corpora to make them compatible AusNC should look at common annotation properties that might cover more than one corpus
Some examples and how they will translate to RDF

Monash:

Speaker turn annotations:

<annotation>
<type>speaker</type>
<val>K2YF1</val>
<start>1047</start>
<end>1084</end>
</annotation>

:n123 a graf:Node
graf:targets :l123 .
:a123 a graf:Annotation,
graf:annotates :n123,
graf:type ausnc:speakerturn,
ausnc:speaker monash:K2YF1 .
:l123 a graf:UTF8Region,
graf:start 1047,
graf:end 1084 .

Again from Monash:

<annotation>
<type>overlap</type>
<val>overlap</val>
<start>325</start>
<end>329</end>
</annotation>

<annotation>
<type>speaker</type>
<val>RF3</val>
<start>281</start>
<end>333</end>
</annotation>

<annotation>
<type>overlap</type>
<val>overlap</val>
<start>334</start>
<end>337</end>
</annotation>

Here the two instances of overlap relate to each other algthough there is no explicit relationship in the original. So we need to infer the relationship based on location of the overlap. They should come in pairs and be in different speaker turns.

:n124 a graf:Node
graf:targets :l124 .

:a124 a graf:Annotation,
graf:annotates :n124,
graf:type ausnc:overlap,
graf:overlapswith :n125 . # do we overlap with a node, I think so since an annotation isn't a real thing

:l124 a graf:UTF8Region,
graf:start 325,
graf:end 329 .

:n125 a graf:Node
graf:targets :l125 .

:a125 a graf:Annotation,
graf:annotates :n125,
graf:type ausnc:overlap,
graf:overlapswith :n124 . # this might not be needed - or might be a bad idea

:l125 a graf:UTF8Region,
graf:start 334
graf:end 337 ,.

An example from ACE, a correction

<annotation>
<type>correction</type>
<val>our</val>
<start>32</start>
<end>34</end>
</annotation>

:n123 a graf:Node
graf:targets :l123 .

:a123 a graf:Annotation,
graf:annotates :n123,
graf:type ausnc:correction,
ausnc:correctedtext "our". # need to define a property name for this

:l123 a graf:UTF8Region,
graf:start 32,
graf:end 34 .