A TEI Project

Guidelines for conceptual mapping of TEI documents

Table of contents

1. Background

In the last few years, the TEI Ontologies SIG has been working on how world knowledge is expressed in TEI documents, in connection to other standards such as CIDOC-CRM and FRBR. This work has been reported in the meetings of the SIG as documented on the SIG Wiki (http://wiki.tei-c.org/index.php/SIG:Ontologies). In addition to the papers listed on the Wiki, an article titled “TEI and cultural heritage ontologies: Exchange of information?” was recently printed in Literary and Linguistic Computing.

As agreed at the SIG meeting in November 2007, an important next step will be to “start on the development of guidelines for how to create TEI documents that easily may be mapped to ontologies such as the CIDOC-CRM”. This document comprises a draft for such a set of guidelines.

2. Introduction

Write: This document is intended for... From reading it, you will learn...

This document is meant as a practical set of guidelines. We will not discuss the theoretical implications in any depth. In the bibliography in the end, there will be pointers to more theoretical discussions.

In this document, we assume a workflow consisting of the three following steps:

  1. TEI encoding of a document based on a reading and understanding of a text. This will include structural mark up (e.g. chapters, paragraphs and italics) as well as semantic mark up (e.g. names and dates).
  2. Extracting semantic information based on the TEI encoding in step 1. One example of this is a person element with some information about the person and references to all name elements in which this specific person was referred to.
  3. Mapping the semantic information extracted in step 2 to another formalism. e.g. CIDOC-CRM, FRBR or a literary ontology.

The main work of the Ontologies SIG up until this year has been on the second and the third of these steps. This document will give advise for how to perform the first two steps in order to succeed in the third.

3. Rationale

In most cases, the reason for wanting to do step 3 is to enable exchange of information. We want to be able to connect information extracted from TEI encoded texts to other sources of information without loosing the links between the extracted information and the encoded text. This implies (amongst other things) connecting the TEI standard to the various initiatives and formats grouped as the "semantic web".

An index can be created based on step 1 and 2 only. But in order to make an extended index, including data expressed in other formalisms than TEI, step 3 is needed as well.

Practical examples of this use would be:

This is far too general. You do not describe the motives researchers may have to choose this or that approach — and that is necessary, because these four examples are very different indeed. Are we able to rephrase some or all of these so we can keep them, or should they be removed?

Whatever the use may be, the method will open up for inclusion of TEI data in the semantic web. Not just the documents as items, but the world information described inside them. It opens for doing this on different levels of complexity, at different stages. The whole process can be pretty simple. Or one can make a complex mapping, but still export simpler versions from the mapping when that is requested. Converting from CIDOC-CRM to Dublin Core, for instance, is a well defined process. This also will enable mappings to future standards to come, e.g. in connection to Google or Yahoo.

4. Definition

Not everyone in the TEI community agrees that TEI projects should have clearly articulated ontologies, because to some people ontologies are taxonomic and reductive. On the other hand, there are conceptual models or even world views implicit in all TEI projects as they exist now. There are vestiges and fragments of these conceptual models present in the current TEI tag to the degree that they are systematic. It seems to us important to ask people designing TEI projects and encoding individual documents to articulate and publish the conceptual model that defines their assumptions and that enable their encoding to be mapped onto . Furthermore, it seems obvious that articulating the world view underlying text encoding is fundamental to good analysis.

The term ontologies has been described and defined already: “The term ontology means literally the study of being and was until recently the name of a branch of philosophy and a term used in the singular only. During the last ten years the term has been adopted by computer and information sciences and the scope of term has been expanded significantly. Today, it may denote everything from data models to classification systems and explanatory models in natural sciences.” For these guidelines, we use ontology to mean conceptual model, “a formally defined model resulting from an analysis of a specific domain and not necessarily a data model in the computer science sense.” (Ore & Eide: “TEI and cultural heritage ontologies: Exchange of information?” LLC 24(2) 2009.)

5. History

Write: History of TEI

Write: History of onotolgies and of some models, e.g. CIDOC-CRM, FRBR, Dublin Core and literary ontologies.

All these standards have been developed in a context, for a purpose. The histories are import in order to understand that, which will help in understanding why they are as they are, and why things seens as shortcomings are there for a reason.

6. Local ontologies in TEI

There are silently assumed world views in the TEI, but no clearly defined ontology. The most explicit world view is expressed in the TEI header and in the bibliographic reference module with a complete set of elements to encode common library practise. In addition the manuscript module represents a set of elements reflecting the normal content of manuscript catalogues encountered by the Master project.

In the early days of the TEI Ontologies SIG, we tried without success to state clearly the aim of the SIG. It was originally stated as "Identifying TEI elements of special ontological interest". This initiated a discussion, because the group rejected the wording. Several other phrases were suggested to replace "special ontological interest", among them "extra-textual (ontological) interest" and "references to the physical world". None of these truly covers what we want to express, though. Nevertheless, there were an common general understanding about what kind of elements we were talking about: Elements such as names, date and performances of plays, while elements such as italics, stanza and paragraphs are outside the scope of this SIG.

Every element in the TEI is the expression of a claim about some portion of text and that claim has some position in a larger set of claims, whether expressed or not. While this is true, the scope of this document, is to discuss ontologies as descriptions of a world described by, or created by, the text, that is, a world external to the text.

Besides these two domain specific areas the ontological schema of TEI consists of:

Any ontology or datamodel which can be expressed by instances of the above elements can be expressed in TEI. In, for example, the FOAF example in wikipedia, see below all person-person predicates/properties can easily be expressed in TEI tags. The rest of the properties in the example are name, address, about. There exists a web-pointer element in TEI which can be typed to take care of that, if you want to.

7. Mapping

Let us now reconsider the three steps from the introduction, and sections each of them in some more detail. In doing so, we will include another phase that is relevant to the process after the three steps, namely a short discussion on how to do the job, at the level of practical tools.

The methods described in this document is based on how things have been done in the TEI tradition. One example of using this approach would be to take a set of legacy documents in TEI P3 in which names and dates have been encoded. After an upgrade to P5, the documents can be said to be through phase 1, and 2 and 3 is ready to be done. One may wish, based on the renewed analysis to be performed, to add some extra encoding at the level of phase 1, though.

There is no one-to-one-relationship between the steps. It is perfectly possible to do step 2 in several different ways for the same step 1 document, and the same with step 3 relating to step 2.

Conceptual models should be part of the analysis of a text for the application of markup. The purpose of the mark up will then include the target ontology or ontologies. To which formalism do you plan to map the structures created in step 2? This leads nicely into the question of what the need for mark up will be in order to serve these needs.

Perspective of a person is important in fictions, but also in types of non-ficion: Who is the speaker/thinker/creator of views inside the text.

One objective which I would like to see stated more explicitly, is that the intention is to describe a generic mechanism for indexing content, i.e. for defining the relationship between structured metadata (Feature Structure; RDF; Topic Map; whatever) representing real-world entities, and the spans of text in the TEI-encoded document which underpin/justify/expand upon that metadata. I would see this as a significant development from, say, a Linked Data perspective. There are billions of triples out there in the LD cloud, but typically nowt to read once you have resolved your SPARQL query. This linked text would, in effect, be the "O" in the TAO of Topic Maps[2].

Or maybe this isn't part of the plan: there is much talk of "mapping" and of import/export in the current draft, which suggests that the objective may be more to allow the extraction of ontological data from its TEI source, and its use in other contexts, relatively divorced from that source.

7.1. Step 1

This level is the level in which the world outside the text is not taken into consideration. A text can be marked up without taking into consideration the world outside the text. That is, only structures in the text is marked up in this step — the next level has, of course, been considered in the planning. In order to use information in a formalised way at a later step, it have to be marked up.

It is not simply about "not taking into consideration the world outside of the text". It is more about "we know an ontology is implicit in this text that we are encoding, but we prefer to refer to it in strictly the same way as the text itself is referring to it (rather than re-construct the conceptual model behind the text ourselves)" hence, the importance of persName and rs (referring string) rather than person.

The modules and elements of TEI most important for this work is described above. Here, we will consider some elements that often will be important in mappings: names and other referring strings denoting persons, places and events.

7.1.1. Names and referring strings

Types of information in the text to be encoded typically includes names:

  • person names and other referring strings denoting concepts of person-type.
  • Place names and other referring strings denoting concepts of place-type.

Referring strings ("he", "that place") may be equally interesting as names in some applications.

Need high level of expicitness in order to interpret encoded texts so that the information they express can be modelled in a conceptual model. This explicitness can be added to the markup, or it can be in the extraction alghorithm ­ what is the best trade-off?

7.1.2. Events

Events are more difficult than persons and places, as they can not generally be identified by names. Some have names, like Second World War, but most do not. So the approach of marking up all events in a running text is seldom possible. One have to have a very precise idea of what to mark up based on the aim of the work. As the events are seldom identified by names, it is also a question what in the text should be marked up. The exact words identifying the event if often hard to find, and maybe not very important.

This means that in running texts, marking up a reference to an event could be done by <rs type="event"> or <milestone type="event"/>. One could even use the paragraph as the smallest unit. In that case, each paragraph needs to have an xml:id, and the pointers from event elements in the header will be to one or more paragraphs.

Possible criteria for events to mark up the reference to could be:

  • Makes material changes in a person or a place.
  • The existence of a date element.
  • Person-place-date all have possible connection between them - any event that functions to attach two or more of (person, place, date) should be marked up.

If one is working with/looking for events in TEI documents, one should be aware of the events to be found in manuscript descriptions and transcriptions of oral sources.

7.2. Step 2

7.2.1. What to store

The typical types of information to be stored in the header include person lists with person elements, place lists with place elements, and possibly event lists — the events can be included in the person or place elements as well. There will be a one-to-one relationship between a person element and a real life/fictionous person, and the same principle for places and events.

If one want to record information about acting entities that are not persons (storms, animals, washing machines), e.g. in fiction or in ethnographical texts, one may want to adjust TEI by adding an element agent similar to person.

7.2.2. Where and how to store

The question of storage is really two different questions, along different axis. The first question is related to where the information is stored. Should the model created in step 2 be stored in the TEI document header or somewhere else, e.g. in another TEI document or in a database? I our opinion, this is a question that should be answered by each project in a pragmatic way. For smaller projects, it may be best to keep everything in one file, whereas in larger projects with many people involved, and XML or relational database may be needed in order to organise the information in a secure and useful way.

The other question is about which formalism to store the data in. Which format should it be stored in? In TEI, in RDF/OWL, or in a database format? We would advise to store in the TEI format, using the tools developed in P5 for this purpose. If it is stored in another formalism, we would advise to include a method for converting the stored data to TEI. If some smaller changes is needed in order to do this, we recommend that an ODD is created for the purpose. If the data not at all covered by TEI as it is today, and another formalism exist, then one could use that instead.

Whichever choice is made regarding this second question, we assume there will be links between the names in the texts and the modelled objects in the ontology. And regardless of the answer to the second question, we suggest it would be good to include in the system a method for exporting the data into TEI.

One can also choose to store the person type information in another formalism than TEI, without doing step 2. If this solution is chosen, it is important to make sure that the references between the each element in the two formalisms are kept. Name type elements will always be encoded in TEI. For the person type elements, they may be encoded in TEI, typically in the header or a separate section in the body, or they may not be encoded in TEI and only be stored in an external conceptual model. In the former case, links will go from name type elements via person type elements to the conceptual model (and backwards), in the latter case, the links will go directly between the name type elements and the conceptual model.

7.2.3. Relations

Need more than nesting in order to connect values as in marrige example. In very simple cases nesting may do, but soon cases will appear when the relationships are too complex, such as "this also applies to the persons discussed in the last paragraph".

Will relations always be between place/person type elements, and not name type elements?

No: Cannot put relation between person element, relations have to be connected to the context, often the name, commonly the event (e.g. or marriage) This does not mean that two strings of characters marry, but it means that a marriage cannot be seen as a relationship taken out of time and place - it has to be connected to a place in the text.

7.3. Step 3

There are no principle differences between fiction and non-fiction when step 2 is finished and the mapping process is about to start. The type of ontology one will want to map to will differ, though, and this will influence the process of mapping. In some cases, it will be a two way mapping, where the contents can be translated from TEI to the external ontology, then back again, being identical to the original. In other situations, the mapping will need to include changes or simplifications, making a mapping back to the original impossible.

Certain TEI elements may contain information that could be hard to map because some necessary information, e.g. reasons why something is asserted and who is responsible for the assertion, may be hidden. Some will also be based on a possible not formally available "point zero", such as age. If this is detected early enough, it will be possible to feed the extra information needed back into steps 1 and 2.

Three levels of document collections:

Conclusion/suggestions: Ontology mapping cannot be defined for 1, but could be defined for 2 and 3. If mappings are done on these levels, publish them! Including the ODD defining the TEI version being the souce of the mapping. Building up a library in connection to these guidelines?

7.4. Doing the work

Any project who are going to use these methods will need some sort of application to do the actual extraction of information from the TEI documents. Such applications are often written in XSLT, but any scripting or programming language could be used, e.g. PERL or PYTHON.

Even if it is impossible to make tools for mapping of all possible TEI documents, it would still be a good idea to develop applications that can be used to extract conceptual models from specified groups of TEI documents. Such applications could be used as is by some users, whereas others cn use them as a base for developing their tailor-suited systems. This would be similar to the XSLT stylesheets availible for transformation into HTML and PDF.

It is also important to store mappings already done, including the ODD documents describing the TEI source documents. This could be in a form of a library handeled by the TEI Ontologies SIG.

As we see it, the area in which good tools are most urgently needed is step 2. Storing the semantic information in the TEI header will include handeling a big number of id-idref pairs, which is not easy using an XML editor only.

One solution is to connect the encoded document from step 1 to a database in which the step 2 information is entered, then exporting from the database into a TEI formalism to include in the header of the TEI file or in a separate TEI file in connection to the others.

This may also overlap with the TEI Tools SIG.

One of the good things with this approach is that once your data is stored in a conceptual model, converting to simpler structures (Dublin Core, google-friendly models) will be easy. Should be suggest that tools would be developed for this?

8. Examples

3-4 examples in here: TEI to CRM, TEI to FRBR?, TEI to literary ontologies, TEI to DC. Mappings to systems developed by Yahoo and Google as well?

Discuss the marrige example (TEI P5 sec. 13.3.2.3 Personal Events) in details.

8.1. Example 1

In order to show how this can be done in practise, we will work through a small example. It is taken from a hypothetical archaeological excavation report. The documents will be available a TEI document in order to be published on the web as a first class document, and to be exchanged with other users in the TEI format. The information expressed in the text, based on a specific reading, is also going to be imported into a CIDOC-CRM based database. Therefore, we will need a mapping of the information in the TEI document into the CIDOC-CRM database.

The text is as follows:

The excavation in Wasteland in 2005 was performed by Dr. Diggey. He had the misfortune of breaking the beautiful sword (C50435) into 30 pieces.

A typical step 1 encoding of this example would be:

<p xml:id="p1">The excavation in <name type="placexml:id="n1key="place1">Wasteland</name> in <date xml:id="d1">2005</date> was performed by <name type="personxml:id="n2key="person1">Dr. Diggey </name>. He had the misfortune of breaking the beautiful sword (C50435) into 30 pieces.</p>

This encoding will provide the source for web and print publications, and it will give the necessary markup for creating name indexes. In order to make better indexes, taking persons and places, and not only their names, into consideration, one need to introduce the idea of persons. This can be done using person and place elements in the TEI header (step 2):

<sourceDesc>  <listPerson>   <person xml:id="person1">    <persName>Charles Atlas Diggey</persName> <!--...more about the doctor...-->   </person> <!-- ...more persons... -->  </listPerson>  <listPlace>   <place xml:id="place1">    <placeName>Wasteland</placeName> <!-- ...more about Wasteland... -->   </place> <!-- ...more places... -->  </listPlace> </sourceDesc>

But there are more information we want to deduce from this little text, namely two events and the information about the object. A possible mark up including this, as a revisit to step 1, would be:

<p xml:id="p1">  <rs type="eventxml:id="e1">The excavation in <name type="placexml:id="n1">Wasteland</name> in <date xml:id="d1">2005</date>  </rs> was performed by <name type="personxml:id="n2">Dr. Diggey </name>. He had the misfortune of <rs type="eventxml:id="e2"> breaking <rs type="objectxml:id="o1">the beautiful      sword <rs type="objectxml:id="o_id1">(C50435)</rs>   </rs> into    30 pieces</rs>. </p>

Here, the descriptions of the events and of the object is seen as referring strings, for which a typology is created, including event and object. This typology should be stored in the TEI header.

In order to sum up the information we want to records, it is all related to the two important events in the text:

Most of this information could be stored in the TEI header using the what is already available. But the object would need an addition to TEI.



TEI Ontologies SIG. Date:
This page is copyrighted