1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

XML-TXM

From txm

(Redirected from Xml-txm-tei)
Jump to: navigation, search

Contents

XML-TXM Format Specification V2

Lexical Units

tei:w element

This element encodes all lexical units of a textual unit.

  • attribute xml:id : w_XXX_xxxxxx (XXX=text id and xxxxx=word number) IMPLEMENTED
    • xml:id is unique inside a corpus
    • attribute xml:id will never change IMPLEMENTED
    • if you need to add a token, you do : id = lastNum+1
    • use attribute lastNum of tei:TEI/tei:Header element to build the id of a new word
  • attribute n the n° of the word (may change)
  • attribute xml:lang (=code iso, optional?)
    • xml:lang attribute only used if the language of the token is different from the context, otherwise, it is inherited from ancestor elements
  • may contain multiple txm:form elements IMPLEMENTED
  • may contain multiple txm:ana elements IMPLEMENTED
    • the number and order of txm:form and txm:ana may vary in different tokens
    • the import modules must receive a list of form and ana to import in the form ((type1+resp1 name1 mandatory|optional) (type2+resp2 name2 mandatory|optional) ...) [import parameter - use txm:ana@type + txm:ana@resp]
  • may have tei:note child elements
<w xml:id="w_fro_000666">
 <txm:form type="default">Lancelot</txm:form>
 <txm:form type="dipl">lanc<ex>elot</ex></txm:form>
 <txm:form type="facs">lan<bfm:mdvAbbr>c̅.</bfm:mdvAbbr></txm:form>
 <txm:ana resp="#editor" type="#pos">NOMpro</txm:ana>
 <txm:ana resp="#editor" type="#q">0</txm:ana>
 <txm:ana resp="#editor" type="#aggl">no</txm:ana>
</w>

txm:form element

This element encodes all surface forms of lexical units.

  • contains one of the form of a word IMPLEMENTED
  • if multiple txm:form element, the attribute 'type' differs them. values: default, ... IMPLEMENTED

txm:ana element

This element encodes all annotations of a lexical unit.

  • contains an annotation of word including msd, lemma... IMPLEMENTED
  • attribute type : type of annotation IMPLEMENTED
    • type allows to identify repetition (not ambiguity) and a part of the name of the property
  • attribute tei:key: available
  • attribute tei:resp: ref to the resp id IMPLEMENTED

Textual Units

tei:TEI (tei:teiHeader + tei:text)

This element encodes all textual units.

  • tei:TEI element is mandatory
    • if a text has no id, generate one with "text" + integer starting from 1
      • change words' id
  • if 2 texts have the same id, add an integer suffix to one starting from 1
  • @type = "standoff" if the file contains standoff annotations
<text type="standoff">
 <body>
   <linkGrp type="fropos">
    <link targets="#w_fro_000001 #NOMpro"/>
    <link targets="#w_fro_000002 #VERcjg"/>
...

tei:teiHeader

This element encodes all informations needed for text processing.

  • tei:teiHeader is mandatory
  • technical metadata are encoded in txm:metadata milestone sub-elements of the txm:applicationDesc sub-element (sibling of tei:encodingDesc, tei:fileDesc, tei:profileDesc, tei:revisionDesc)
    • the version number of the TXM which has produced the corpus is encoded in '<txm:metadata name="version" value="0.6"/>'
    • the text 'lastNum' is encoded in '<txm:metadata name="lastNum" value="1234"/>'
    • the text main language is encoded in '<txm:metadata name="lang" value="en"/>'
    • external metadata - imported from CSV - are encoded in '<txm:metadata/>' element for each metadata
      • the 'name' attribute comes from the first line of CSV
      • the 'value' attribute comes from the text line column cell
  • with lexical annotations :
    • the tei:fileDesc contains a tei:titleStmt with one tei:respStmt per annotation tool
      • Example : *TODO*
    • the tei:encodingDesc contains a tei:appInfo/txm:application per annotation tool
      • Example :
<txm:application ident="TreeTagger" version="3.2" resp="txm">
  <txm:commandLine>
    <txm:os name="Linux" arch="x86"/>
    <txm:osSeparator file="/" path=";" line="&#13;"/>
    <txm:program name="tree-tagger" path="~/TXM/treetagger/bin" desc="TreeTagger is the name of the project TreeTagger" version="3.2"/>
    <txm:args>
      <txm:arg name="token" type="none"/>
      <txm:arg name="lemma" type="none"/>
      <txm:arg name="sgml" type="none"/>
      <txm:arg name="no-unknown" type="none"/>
      <txm:arg name="eos-tag" type="String">&lt;s&gt;</txm:arg>
      <txm:arg name="model" type="String">treetagger/models/fr.par</txm:arg>
      <txm:arg name="input" type="String">TXM/corpora/test/treetagger/txtbrut.xml.tt</txm:arg>
      <txm:arg name="output" type="String">TXM/corpora/test/treetagger/txtbrut.xml.tt-out.tt</txm:arg>
    </txm:args>
  </txm:commandLine>
  <ab type="annotation">
    <list>
      <item>
        <ref type="tagset" target="#ttpos"/>
      </item>
      <item>
        <ref type="tagset" target="#ttlemma"/>
      </item>
    </list>
  </ab>
</txm:application>
  • the tei:encodingDesc contains a tei:classDecl/tei:taxonomy per annotation type
    • the name of the annotation for CQL is encoded in tei:classDecl/tei:taxonomy@xml:id
    • Example : *TODO*

Intermediate text structures

  • Any xml element (TEI or not) with text content is indexed and can be used in queries and subcorpus construction
  • <tei:s> is used for sentences and can optionally be added automatically during the import process

Out-of-text elements

  • Unless otherwise specified the content of <tei:note> is conidered out-of-text
  • see also <tei:teiHeader> for the metadata

Parallel Corpora

tei:teiCorpus

This element encodes sets of texts and parallel corpora.

  • if a corpus has no id, generate one with "corpus" + integer from 1
  • if 2 corpora have the same id, add an integer suffix to one from 1
  • contains one or several tei:TEI
  • if a 'align.xml' file is present in the same directory
    • it must be a tei:TEI element containing a tei:text with a tei:linkGroup element with tei:links encoding relation between 'corpus1' and 'corpus2', 'corpus1' and 'corpus3', etc. in the following way :
...
<tei:linkGroup type="align">
 <tei:link target="#corpus1 #corpus2" txm:alignElement="div" txm:alignLevel="2"/>
 <tei:link target="#corpus1 #corpus3" txm:alignElement="p"/>
 <tei:link target="#corpus2 #corpus3" txm:alignElement="s"/>
</tei:linkGroup>
...
  • aligned elements must have a txm:alignId attribute unique to a tei:teiCorpus and shared between one pair to several pair of corpus.
  • recursive use of element can be encoded by a txm:alignLevel attribute. Example : txm:alignLevel="2" if the div element is contained by another div.
  • corpora must be strictly aligned (behaviour of txm, in other cases is not defined) TEST REQUIRED

tei:linkGrp

  • @type value is "align"
  • contains one link element per alignement

tei:link

  • @targets contains a list of corpus names
  • @txm:alignElement the aligned element name
  • @txm:alignLevel the depth of the aligned element

Editions

  • <tei:pb/> element is used by default to paginate the editions
    • its @n will be used in references and will be shown on top of the page
 This setting can be overridden by specifying an arbitrary empty "paginationElement" in the import specification
 In addition or instead of the pagination element, the maximum number of words per page can be set in the import specification.
  • <tei:head> element is transformed into <html:h2>
  • <tei:w> is transformed into <html:span> with @id (used to highlight query hits in the "back-to-text from concordance" functionality

Comments

Other Versions

TODO V3

  • remove '#' from txm:ana/@type

Other points we need to see when we have time.

  • w
    • [AL] as stated in current ODD, may also have @ref (as member of att.canonical), @subtype & @cert (as member of att.responsibility). If we remove these attributes, we lose potential TEI compatibility
  • form
    • [AL] as stated in current ODD, may also have @ref (as member of att.canonical), @subtype. If we remove these attributes, we lose potential TEI compatibility
  • questions
    • how to manage internal elements in w element ? (supplied, choices, corr, sic ... ?)
      • [AL] they should be allowed within txm:form, but maybe with restrictions (e.g. create multiple txm:form instead of choice)
      • these elements should probably be ignored in CQP indexes but used in building editions
    • Allow multiple text element ?
      • [AL] within a corpus ???


V1

Until TXM 0.5 : XML-TXM V1 format specifications are not public.

V2

From TXM 0.6 to ...

Implementation

The XML-TXM format is defined as an extension to the TEI standard schema through ODD specifications.

The ODD specifications sources are hosted on Sourceforge : http://txm.svn.sourceforge.net/viewvc/txm/trunk/doc/tei-txm
Personal tools