From: Conal T. <Con...@vu...> - 2005-04-29 03:31:21
|
The website of the New Zealand Electronic Text Centre has been re-launched, with TM4J as a key software component. http://www.nzetc.org/ The website is a digital library, providing access to a couple of hundred digitised books and manuscripts. The site has been running for about 3 years, but this week we've upgraded it significantly, putting it on a new foundation - a topic map.=20 The topic map presently contains 46807 topics, 192492 associations, and 43942 occurrences; roughly 150Mb of XTM. We are using tm4j's "in-memory" back-end, running on Java 1.4.1 on Windows 2000. The topic map consumes approximately 1.3GB of RAM. The front end of the site uses Cocoon to render pages (each of which represents a topic, and some "neighbouring" topics). We use Cocoon's templating system "jxtemplate" to render each topic. JXTemplate is designed to be very like XSLT, with an expression language called "JXPath" which is more-or-less a superset of XPath, but which also allows for traversal of Java objects via path expressions, e.g. "$topic/occurrences[type=3D$ontology/html]". This avoids the conceptual mis-match that can occur when using XSLT, which is tree-oriented, to style XTM, which really represents a cyclic graph. We had to write a few Java functions to add JXPath support for topic sorting, traversal of the type hierarchy, and a few other features, but nothing too hard. We use several different templates to render the different types of topics. The source material for the site is a collection of TEI (Text Encoding for Interchange) XML files, each of which is an encoding of a source object (i.e. a book). Most of the topic map is harvested from these files using XSLT. Each book, chapter, subsection, figure, author, publisher, etc, is represented by a topic, names are harvested from headings and captions in the text, and the containment hierarchy is represented by associations. These associations are used to generate tables of contents, as well as to provide "next" and "previous" links between web pages.=20 For each fragment of TEI text, we harvest 2 HTML occurrences which are alternative representations of that piece of text. One is a "scholarly" (fussy) view, in which page numbers, errors, deletions and corrections (in manuscripts), etc are all rendered, and the other is a "basic" (simplified) view, in which spelling errors are silently corrected, page numbers are not displayed, etc. These alternatives are distinguished with "basic" and "scholarly" scoping topics. At present only the scholarly view is visible on the public website, but we plan to make the basic view visible during next week. Cocoon XSLT pipelines are used to transform the TEI into HTML (and some other formats). Names of people, places, etc, are also marked up in the TEI, and these are also harvested as topics, with associations linking each person to the places in the texts where they are mentioned, the figures in which they are depicted, and to the texts which they wrote. We use a MADS XML file to maintain an authoritative list of names, from which we also harvest some biographical notes and links to external websites. Consequently, the system can generate a web page to represent each person, providing links to all the places in the library where they are mentioned, all the texts they wrote, and a thumbnail gallery of the pictures in which they appear, and links to relevant external sites. e.g. http://www.nzetc.org/tm/scholarly/name-207418.html The ontology used is a subset of the CIDOC CRM (a museum ontology). In future we plan to harvest dates from the texts, and provide timeline-based access to the texts. Our main technical concern is to replace the in-memory back-end with a database, since we are running out of memory and will need to scale up the topic map as our collection grows, and as we add more semantic markup to the TEI.=20 As the lead developer on this project, I want to take this opportunity to publicly express our gratitude to Kal Ahmed, and to all the TM4J contributors, for enabling us to get this project off the ground. Thanks heaps!!=20 Regards Con -- Conal Tuohy Senior Programmer +64-4-463-6844 +64-21-237-2498 co...@nz... New Zealand Electronic Text Centre www.nzetc.org ---- "I believe we were all glad to leave New Zealand. It is not a pleasant place. Amongst the natives there is absent that=20 charming simplicity which is found in Tahiti; and the greater part of the English are the very refuse of society. Neither=20 is the country itself attractive. I look back but to one=20 bright spot, and that is Waimate, with its Christian=20 inhabitants." - Charles Darwin |