The Unicode and Texteme Ontologies Documentation

Ontologies of Unicode characters, textemes, and text properties

Status: Alpha

Brought to you by: gbella

Home

Home of the Unicode® and Texteme Ontologies

The Unicode ontology is a set of RDF/OWL files that implement much of the content of the Unicode Character Database, and some more aspects of the Unicode Standard.

**Want a quick example? Here is LATIN CAPITAL LETTER A: http://purl.org/textontology/unicode/char/0041 **

The added value with respect to other implementations of the UCD is that character properties are also formally modelled and further described through meta-properties.

The Texteme ontology defines, among others, the concepts of:

texteme which is a generalised text element: a Unicode character is a kind of texteme, as is a glyph from a font;
sequence which is simply a text string;
text property that describes textemes, characters, and sequences.

The Unicode ontology is based on and imports the Texteme ontology.

Questions? Check out the FAQ.

Note: the ontologies, as well as these pages, are still being modified on a daily basis. All of this is pretty much work in progress. Check back regularly for updates.

Disclaimer: this project is not affiliated with nor officially endorsed by the Unicode Consortium.

What’s New?

Version 0.6.2 (21 September 2012):
- Improved Unihan support. Unihan properties are now typed:
  - "radical" properties point to the appropriate radical characters,
  - "variant" properties point to CJKVariant instances that fully describe variants,
  - the rest are implemented either as integer or as string datatype properties.
- A new Unicode sequence interface has been created for the dynamic generation of Unicode sequences (or strings). To obtain the resource URI, just concatenate character codes (without the "u" prefix) with an underscore and prefix the whole with "http://purl.org/textontology/unicode/sequence/Seq_". For example, the sequence "AB" is generated by the URI http://purl.org/textontology/unicode/sequence/Seq_0041_0042 (you may try this link out).
Version 0.6.1 (7 August 2012):
- A prototype version of the Unihan database is now available. It is downloadable as a ZIP, or individual Han characters are accessible under http://purl.org/textontology/unicode/char/CODE (where CODE should be replaced by the character code). For the moment, all Unihan character properties are implemented as string datatype properties and are barely described. This will change in the near future.
- Consequently, the whole character set is now covered (that of Unicode version 6.1).
- Various bugfixes.
- The FAQ section has been updated.
Version 0.6.0 (31 July 2012):
- Major update: RDF/OWL descriptions of specific characters are now downloadable individually. The URL is http://purl.org/textontology/unicode/char/CODE where CODE should be replaced by the character code, such as: u0041, u10000, or without the prefix: 0041, 10000. Try it out: http://purl.org/textontology/unicode/char/u0041 (that's supposed to be the latin capital letter a).
- Accordingly, namespaces for Unicode characters have been set to the very same URLs.
- Character classes that define script-wide common properties are now available as separate OWL files under http://purl.org/textontology/unicode/uo_SCRIPT_common.owl (e.g., http://purl.org/textontology/unicode/uo_Latn_common.owl).
- The ontology ZIPs have been updated accordingly and are available for download.