[TM4J-users] tm4j-backed website launched

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

The website of the New Zealand Electronic Text Centre has been
re-launched, with TM4J as a key software component.
http://www.nzetc.org/

The website is a digital library, providing access to a couple of
hundred digitised books and manuscripts. The site has been running for
about 3 years, but this week we've upgraded it significantly, putting it
on a new foundation - a topic map.=20

The topic map presently contains 46807 topics, 192492 associations, and
43942 occurrences; roughly 150Mb of XTM. We are using tm4j's "in-memory"
back-end, running on Java 1.4.1 on Windows 2000. The topic map consumes
approximately 1.3GB of RAM.

The front end of the site uses Cocoon to render pages (each of which
represents a topic, and some "neighbouring" topics). We use Cocoon's
templating system "jxtemplate" to render each topic. JXTemplate is
designed to be very like XSLT, with an expression language called
"JXPath" which is more-or-less a superset of XPath, but which also
allows for traversal of Java objects via path expressions, e.g.
"$topic/occurrences[type=3D$ontology/html]". This avoids the conceptual
mis-match that can occur when using XSLT, which is tree-oriented, to
style XTM, which really represents a cyclic graph. We had to write a few
Java functions to add JXPath support for topic sorting, traversal of the
type hierarchy, and a few other features, but nothing too hard. We use
several different templates to render the different types of topics.

The source material for the site is a collection of TEI (Text Encoding
for Interchange) XML files, each of which is an encoding of a source
object (i.e. a book). Most of the topic map is harvested from these
files using XSLT. Each book, chapter, subsection, figure, author,
publisher, etc, is represented by a topic, names are harvested from
headings and captions in the text, and the containment hierarchy is
represented by associations. These associations are used to generate
tables of contents, as well as to provide "next" and "previous" links
between web pages.=20

For each fragment of TEI text, we harvest 2 HTML occurrences which are
alternative representations of that piece of text. One is a "scholarly"
(fussy) view, in which page numbers, errors, deletions and corrections
(in manuscripts), etc are all rendered, and the other is a "basic"
(simplified) view, in which spelling errors are silently corrected, page
numbers are not displayed, etc. These alternatives are distinguished
with "basic" and "scholarly" scoping topics. At present only the
scholarly view is visible on the public website, but we plan to make the
basic view visible during next week. Cocoon XSLT pipelines are used to
transform the TEI into HTML (and some other formats).

Names of people, places, etc, are also marked up in the TEI, and these
are also harvested as topics, with associations linking each person to
the places in the texts where they are mentioned, the figures in which
they are depicted, and to the texts which they wrote. We use a MADS XML
file to maintain an authoritative list of names, from which we also
harvest some biographical notes and links to external websites.
Consequently, the system can generate a web page to represent each
person, providing links to all the places in the library where they are
mentioned, all the texts they wrote, and a thumbnail gallery of the
pictures in which they appear, and links to relevant external sites.
e.g. http://www.nzetc.org/tm/scholarly/name-207418.html

The ontology used is a subset of the CIDOC CRM (a museum ontology).

In future we plan to harvest dates from the texts, and provide
timeline-based access to the texts. Our main technical concern is to
replace the in-memory back-end with a database, since we are running out
of memory and will need to scale up the topic map as our collection
grows, and as we add more semantic markup to the TEI.=20

As the lead developer on this project, I want to take this opportunity
to publicly express our gratitude to Kal Ahmed, and to all the TM4J
contributors, for enabling us to get this project off the ground. Thanks
heaps!!=20

Regards

Con

--
Conal Tuohy
Senior Programmer
+64-4-463-6844
+64-21-237-2498
co...@nz...
New Zealand Electronic Text Centre
www.nzetc.org

----

  "I believe we were all glad to leave New Zealand. It is not a
  pleasant place. Amongst the natives there is absent that=20
  charming simplicity which is found in Tahiti; and the greater
  part of the English are the very refuse of society. Neither=20
  is the country itself attractive. I look back but to one=20
  bright spot, and that is Waimate, with its Christian=20
  inhabitants."

         - Charles Darwin