A TEI Project

Guidelines for the creation of TEI documents that will map well to ontologies such as the CIDOC-CRM

Table of contents

1. Background: about mappings of TEI documents to CIDOC-CRM

1.1. History

In the last few years, the TEI Ontologies SIG has been working on how world knowledge is expressed in TEI documents, in connection to other standards such as CIDOC-CRM and FRBR. This work has been reported in the meetings of the SIG as documented on the SIG Wiki (http://wiki.tei-c.org/index.php/SIG:Ontologies).

As agreed at the SIG meeting in November 2007, an important next step will be to “start on the development of guidelines for how to create TEI documents that easily may be mapped to ontologies such as the CIDOC-CRM”. Many theoretical considerations are discussed in (Ore, 2009). This document comprises a draft for such a set of practical guidelines.

1.2. About this document

By mapping we will understand the process of creating an ontology [Note: Ontology in the computer science meaning of the expression: “a formally defined model resulting from an analysis of a specific domain and not necessarily a data model in the computer science sense.” (ibid., p. 162)] based on a TEI document. For the sake of simplicity, all examples of such ontologies will be CIDOC-CRM models, but the method will be applicable to other formalisms as well.

The ontology thus created will not contain all information from the TEI document, so that there will be no way to re-create the original TEI document based on the CIDOC-CRM model. Therefore, the processes is not a reversible process, but a one-way extraction of parts of the information present in the TEI document.

The CIDOC-CRM model will contain a subset of the information from the TEI document. Further, what we will discuss in these guidelines are only mapping based on the XML structure of the TEI document, not mappings based on a reading and interpretation of the textual (PCDATA) part of the document.

This means that the mappings discussed here may include a rule that will create a CIDOC-CRM E82 Actor Appellation for each occurrence of a TEI persName element, but there will be no discussion of mappings where a text is extracted and set to represent a person name unless the name is tagged as one in TEI.

There are three different processes to be considered when a document is supposed to be mapped from TEI to CIDOC-CRM:

  1. Do a conceptual mapping of each of the element in TEI we want to map to a class in CIDOC-CRM, or from one TEI element to a set of connected CIDOC-CRM classes, or from a structure of several TEI elements to one or more CIDOC-CRM classes.
  2. Decide on the syntax of the resulting document. This will typically be an RDF file.
  3. Implement these rules into a computer system who can do the actual mapping.

This document discuss the first of these three processes only. It it in the interest of the SIG that work will be done in the future on the two other processes, in tool development as well as in documentation. It is also an aim to produce a set of examples of full processes, including the three steps described above.

Mapping as the word is used here corresponds to the concept of building a CIDOC-CRM model based on the TEI document(s). Thus, it is inherently a modelling process included in our idea of a mapping. If one see the development of the TEI document as a modelling of a pre-existing entity, such as an analog document, this modelling process can then be seen as a re-modelling. A fundamental idea is to keep the TEI document available also from the CIDOC-CRM model a node level, using identifiers (xml:id), thus foreseeing a use where the CIDOC-CRM model does not need to replace the TEI document, but can be used in order to connect information from the TEI document into a two-way relationship with other CIDOC-CRM models and cultural heritage information in general.

In the process of mapping, the word extract is used for the process of creating a part of a mapping for one specific TEI element.

1.3. Intended use

The work in the TEI ontologies SIG has always been based on the interests of people working on the border between text encoding and cultural heritage informatics in general, and mostly in the area of museum information. There has been no principal decision to exclude other types of ontological modelling, such as ontologies of literary characters, but people interesting in such work has not been active in the SIG. This bias can also be seen in the choice of CIDOC-CRM as example target for the mappings.

These guidelines are based on the same body of work and interests. This means that the intended audience will be people working in or for museums or other cultural heritage institutions. They will have a body of texts in TEI, or intend to create such a body. These texts will be interesting partly or only because they say something about the real world, typically of historical interest. Examples of such texts include:

The use these mappings can enable a closer relationship between texts and other cultural heritage information, which is necessary in order to enable the cultural semantic web. The method will open up for inclusion of TEI data in the semantic web. Not just the documents as items, but the world information described inside them.

We hope that these guidelines will eventually be expanded to cover other areas, by people interested in other types of mappings.

2. Elements of common use

There is a general tendency that elements that are interesting to extract into CIDOC-CRM models has a stronger relationship to the world being the subject of the text in the TEI document than to the text itself. As an example of this, hi tend to be of little interest, placeName is usually more interesting, and place is generally a key element.

We will here describe a number of elements that will be good to use in a TEI document one plans to map to the CIDOC-CRM. The list is not extensive, and in many cases, types of textual entities that are not described in the TEI guidelines would be good to encode in the TEI document in order to facilitate better mappings. An example of this would be strings referring to physical objects in an archaeological or anthropological sense, which could be encoded like this:
<rs type="object">the stone axe</rs>
One can also use elements not presently available in TEI but that can be added using the ODD system, in passages such as:
<objectName>The Rosetta Stone</objectName>

2.1. Actors, places and events

TEI elements CIDOC-CRM
person E21 Person
org E74 Group
place E53 Place
event E5 Event

2.1.1. Comments

These elements are closely connected to similar elements in CIDOC-CRM, and their main intended use is in the TEI header as a model connecting various strings referring to them. It may be wise to use CIDOC-CRM in an early stage of the development of the TEI document in order to make sure this model complies with CIDOC-CRM.

The typical types of information to be stored in the header include person lists with person elements, place lists with place elements, and possibly event lists — the events can be included in the person or place elements as well. There will be a one-to-one relationship between a person element and a real life or fictional person, and the same principle for places and events.

If one want to record information about acting entities that are not persons (storms, animals, washing machines), e.g., in fiction or in ethnographic texts, one may want to adjust TEI by adding an element agent similar to person.

2.2. Names

TEI elements CIDOC-CRM
name E41 Appellation
variants such as placeName E82 Actor Appellation and other specialisations of E41

2.2.1. Comments

These elements will be mapped differently according to two different situations: Either there is a set of elements somewhere else in the TEI document describing the entities they refer to (as seen in the section ‘Actors, places and events’ above). In that case, the xml:id type references can be used to make the necessary connection in the CIDOC-CRM model, such as P131 is identified by. Otherwise, if no such descriptive elements exist in the TEI document, the names will have to be modelled as free standing appellations in the CIDOC-CRM model. One may, of course, imply the existence of the entities referred to by the names, but one must be aware that as no co-reference information will be available in this case, the quality of the model will be lower.

It is clear from this that the inclusion of an set of elements of the type of actors, places and events in the TEI header will assist the modelling considerably. On the other hand, the time saved in not creating such elements can be used in doing similar changes to the CIDOC-CRM model, e.g., reflecting the existence of co-reference sets.

Types of information in the text to be encoded typically includes names:

  • person names and other referring strings denoting concepts of person-type.
  • Place names and other referring strings denoting concepts of place-type.

Referring strings ("he", "that place") may be equally interesting as names in some applications.

Need high level of explicitness in order to interpret encoded texts so that the information they express can be modelled in a conceptual model. This explicitness can be added to the markup, or it can be in the extraction algorithm — what is the best trade-off?

Events are more difficult than persons and places, as they can not generally be identified by names. Some have names, like Second World War, but most do not. So the approach of marking up all events in a running text is seldom possible. One have to have a very precise idea of what to mark up based on the aim of the work. As the events are seldom identified by names, it is also a question what in the text should be marked up. The exact words identifying the event if often hard to find, and maybe not very important.

This means that in running texts, marking up a reference to an event could be done by <rs type="event"> or <milestone type="event"/>. One could even use the paragraph as the smallest unit. In that case, each paragraph needs to have an xml:id, and the pointers from event elements in the header will be to one or more paragraphs.

Possible criteria for events to mark up the reference to could be:

  • Makes material changes in a person or a place.
  • The existence of a date element.
  • Person-place-date all have possible connection between them - any event that functions to attach two or more of (person, place, date) should be marked up.

If one is working with or looking for events in TEI documents, one should be aware of the events to be found in manuscript descriptions and transcriptions of oral sources.

2.3. Properties (linking categories from the previous section)

TEI elements CIDOC-CRM
relation Property

2.3.1. Comments

Relationship can in general be mapped into a CIDOC-CRM property. The choice of which property is the correct one will be based upon an analysis of attribute values as well as the source and target of the relation. I will be wise to be explicit about the type of relation and express this as a typology stored in the TEI header with references to it as attributes of the relation elements. On avoid basing mappings on an analysis of the textual content (the PCDATA) of the elements.

In many cases, relations will be another way to express a situation established through an event. The fact that two people are married is the result of a marriage event. When extraction of CIDOC-CRM information is planned, it may be better to encode the event rather than the relationship.

2.4. Timespan

TEI attributes CIDOC-CRM
when
notbefore
notafter
etc.

2.4.1. Comments

These attributes should be connected to the entities they refer to in a clear and consistent way. In the case of variations of use, this should be expressed in the TEI structure, possibly using references to taxonomies.

2.5. Types

TEI elements CIDOC-CRM
taxonomy
thesauri

2.5.1. Comments

These elements can be used to express a great variety of structures. They can in general be mapped to CRM ..., but it is important to consider if later use of the CIDOC-CRM model will be easier if they are mapped into more specialised systems.

This is further connected to the use of feature structures in TEI and their possible mappings to CIDOC-CRM, a topic not covered in this document.

2.6. References to web resources

TEI elements CIDOC-CRM
ptr

2.6.1. Comments

Mappings of these are unproblematic, but the references to other parts of the TEI structure may be problematic to map.

2.7. References to intellectual content on the manifestation level

TEI elements CIDOC-CRM
biblref

2.7.1. Comments

These structures are close to FRBR than to CIDOC-CRM, and if they dominate the TEI structure, on may consider a mapping to FRBRoo instead of to CIDOC-CRM.

That said, they may very well be mapped into document references in CIDOC-CRM as long as they are clearly encoded in TEI.

3. Other considerations

4. A suggested data model

5. An example document with a mapping

In order to show how this can be done in practise, we will work through a small example. It is taken from a hypothetical archaeological excavation report. The documents will be available a TEI document in order to be published on the web as a first class document, and to be exchanged with other users in the TEI format. The information expressed in the text, based on a specific reading, is also going to be imported into a CIDOC-CRM based database. Therefore, we will need a mapping of the information in the TEI document into the CIDOC-CRM database.

The text is as follows:

The excavation in Wasteland in 2005 was performed by Dr. Diggey. He had the misfortune of breaking the beautiful sword (C50435) into 30 pieces.

A typical TEI encoding of this example would be:

<p xml:id="p1">The excavation in <name type="place" xml:id="n1" key="place1">Wasteland</name> in <date xml:id="d1">2005</date> was performed by <name type="person" xml:id="n2" key="person1">Dr. Diggey </name>. He had the misfortune of breaking the beautiful sword (C50435) into 30 pieces.</p>

This encoding will provide the source for web and print publications, and it will give the necessary markup for creating name indexes. In order to make better indexes, taking persons and places, and not only their names, into consideration, one need to introduce the idea of persons. This can be done using person and place elements in the TEI header (step 2):

<sourceDesc>  <listPerson>   <person xml:id="person1">    <persName>Charles Atlas Diggey</persName> <!--...more about the doctor...-->   </person> <!-- ...more persons... -->  </listPerson>  <listPlace>   <place xml:id="place1">    <placeName>Wasteland</placeName> <!-- ...more about Wasteland... -->   </place> <!-- ...more places... -->  </listPlace> </sourceDesc>

But there are more information we want to deduce from this little text, namely two events and the information about the object. The information we want to record is all related to the two important events in the text:

A possible mark up including this, as a revisit to step 1, would be:

<p xml:id="p1">  <rs type="event" xml:id="e1">The excavation in <name type="place" xml:id="n1">Wasteland</name>    in <date xml:id="d1">2005</date>  </rs> was performed by <name type="person" xml:id="n2">Dr. Diggey </name>. He had the misfortune of <rs type="event" xml:id="e2"> breaking <rs type="object" xml:id="o1">the beautiful      sword <rs type="object" xml:id="o_id1">(C50435)</rs>   </rs> into 30 pieces</rs>. </p>

Here, the descriptions of the events and of the object is seen as referring strings, for which a typology is created, including event and object. This typology should be stored in the TEI header. The two events could be seen as parts of either the person or the place element in the header, or it could be free standing. We choose the latter alternative, as it highlights the events as key elements in the data analysis. Thus, the header section will be:

<sourceDesc>  <listPerson> <!-- ...Dr. Diggey and other persons... -->  </listPerson>  <listPlace> <!-- ...Wasteland and other places... -->  </listPlace>  <listEvent>   <event xml:id="event1">    <head>The Wasteland excavation</head> <!-- ...more about the excavation... -->   </event>   <event xml:id="event2">    <head>The sword incident</head> <!-- ...more about the breaking of the sword... -->   </event> <!-- ...more events... -->  </listEvent> </sourceDesc>

In order to formalise information about the objects mentioned in the text, and encoded as eg of type object, which would be useful e.g., when connection the document to museum catalogues at artefact level, the following addition could be made to the TEI header. This would imply an extension of TEI in which two new elements would be added: listObject and object. [Note: Just adding these two elements to TEI is a simple task, but this will lead to an inconsistency, as the already existing elements used to describe manuscripts as object, as seen in section 10.7 Physical Description, should be connected to the new elements.] The full structure of the TEI header section would then be:

<sourceDesc>  <listPerson>   <person xml:id="person1">    <persName>Charles Atlas Diggey</persName> <!--...more about the doctor...-->   </person> <!-- ...more persons... -->  </listPerson>  <listPlace>   <place xml:id="place1">    <placeName>Wasteland</placeName> <!-- ...more about Wasteland... -->   </place> <!-- ...more places... -->  </listPlace>  <listEvent>   <event xml:id="event1">    <head>The Wasteland excavation</head> <!-- ...more about the excavation... -->   </event>   <event xml:id="event2">    <head>The sword incident</head> <!-- ...more about the breaking of the sword... -->   </event> <!-- ...more events... -->  </listEvent>  <listObject>   <object>    <inventory>Wasteland museum C50435</inventory>    <desc>The sword fragments...</desc> <!-- ...more about the sword... -->   </object> <!-- ...more objects... -->  </listObject> </sourceDesc>

6. Conclusions

7. Bibliography

Ore, Christian-Emil Smith (2009)
Ore, Christian-Emil Smith, Eide, Øyvind. Tei and Cultural Heritage Ontologies: Exchange of Information?, Literary & Linguistic Computing, 2009. 24 2 pp. 161-172.


Date:
This page is copyrighted