|
From: Joe W. <jo...@gm...> - 2011-11-02 16:48:26
|
Hi Chris, I'm glad you found the workshop valuable. Sorry for my slow reply! The structure for your data makes sense to me. Admittedly I haven't done an analogous project with annotated bibliographies. I'd invite others to may have suggestions about ways to structure annotated bibliographies in TEI. Also, you might consult http://tei-l.markmail.org/search/?q=bibliographies for recent discussions on doing bibliographies in TEI). And TEI-L would be another good place to ask. Let me focus, then, on the question of how eXist's indexing technologies might factor into the way you store and query your TEI; along the way I'll touch on the elements vs. attributes question. Your example entry is very well suited for rapid searches using eXist's structural, range, and lucene full text indexes. Indexes speed up queries, because rather than "brute-force" scanning through your documents (potentially many gigabytes on disk) to find the answer, eXist can find the answer indexes (much smaller, and likely to fit in RAM; in-memory index searches the fastest). Structural Index: eXist's structural index keeps track of every element and attribute and their structural placements in the database. This makes pure XPath-based queries very fast, e.g. //tei:bibl would quickly return all of the bibl elements in the database. Or //tei:date[not(@when)] would quickly return all of the date elements that do not have a @when attribute. The structural index is always on by default. Other indexes must be manually configured and applied by you: Range Index: A range index stores all of the values of a specific element or attribute, and it greatly speeds up queries on the values of an element or attribute. Range indexes are typed as string, integer, date, year, etc. For example, if you wanted to search for entries in the 1990s, you could apply a year-based range index to the @when attribute, speeding up queries like: //tei:bibl[tei:date/@when ge 1990 and tei:date/@when lt 2000] If you wanted to query all entries from a specific journal, you could apply string-based range indexes to (1) the @level attribute, since that distinguishes between different title types and (2) the title element: //tei:title[@level eq 'j'][. eq 'Some Journal'] Notice a pattern here: whenever you filter an expression with a predicate (the bit in square brackets) that uses comparisons (equals, less than, greater than), you can most likely apply a range index. Think of a range index as a very literal dictionary of the exact values. Notice also that we have been using plain, pure XPath and XQuery here. Your queries simply get faster by virtue of the built-in structural index and the user-specified range indexes. Lucene Full Text Index: Whereas a range index is very literal and enables queries of the full value of an element or attribute, a full text index does a lot more work behind the scenes: it identifies "words" in the text in an element or attribute (typically, by treating a space as the thing that separates words) and stores those words in an index; this process is called "tokenization". Full text indexes also let you search with wildcards like * or ?. So it makes sense to apply full text indexes in cases where you have many words. I wouldn't apply a full text index to the @level attribute in your example, since that contains codes (e.g., "a" and "j"); nor would I apply it to the @when attribute, since that contains years. But I would think about applying one to the title elements, or to the note elements. For example, if you want to search titles that have some words, you can apply a lucene index to the "note" element use this kind of query: //tei:bibl[ft:query(tei:note, 'emotion')] I'll leave it as an exercise for you to think about whether you might ever want a range index and a full text index on the same element, e.g., title. (There's no reason you can't.) To configure and apply these indexes above, you would need to create a collection.xconf file and place it in the /db/system/config/db... collection corresponding to your own data's collection in the database (e.g., for data stored in /db/myapp, you would put the index configuration file in /db/system/config/db/myapp/collection.xconf): <collection xmlns="http://exist-db.org/collection-config/1.0"> <index xmlns:tei="http://www.tei-c.org/ns/1.0"> <!-- Disable the legacy full text index --> <fulltext default="none" attributes="false"/> <!-- Lucene index configuration --> <lucene> <!-- The standard analyzer will ignore stopwords like 'the', 'and' --> <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> <!-- Whitespace analyzer includes stopwords like 'the', 'and' --> <!--analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/--> <text qname="tei:note"/> <text qname="tei:title"/> </lucene> <!-- Range index configuration --> <create qname="@level" type="xs:string"/> <create qname="title" type="xs:string"/> <create qname="@when" type="xs:date"/> <!-- Note on @when: "eXist can only use a range index if all values within that index are valid instances of the defined index type. So every date has to be an xs:date and if there's just one exception, the index will no longer be used. --> </index> </collection> There's quite a bit that you can do to customize your full text search indexes with eXist, including using Lucene query syntax, expressing your query in XML, selecting various Lucene analyzers (with purpose-built and/or language-specific features). Here are some links about these topics: 1. Configuring Database Indexes: http://exist-db.org/indexing.html 2. Tips on Writing Queries: http://exist-db.org/tuning.html#d1973e562 3. Lucene-based Full Text Index: http://exist-db.org/lucene.html 4. Lucene query syntax: http://lucene.apache.org/java/2_9_3/queryparsersyntax.html Nothing I've discussed mandates that you store your data in elements vs. attributes. That said, I believe that there is currently a limitation on applying Lucene full text indexes to attribute data in eXist (no such restriction on range indexes though). Perhaps one of the core eXist developers will comment on that. I hope this answer, belated as it may be, helps. Cheers, Joe On Tue, Oct 18, 2011 at 5:59 PM, Christopher Thomson <chr...@ca...> wrote: > Hello, > > First of all I wanted to say Joe's introductory workshop on eXist at > Oxford this year was really valuable, and although it's taken me a while > to post something to this list, I think it's a great idea and I hope > there are others out there interested in getting started with eXist. For > my own part, I'd like to gather some feedback on a project I'm working > on, a digital edition of an annotated bibliography that has existed for > a few years in print form. It's a modest project, and I'm using it to > learn XQuery and half a dozen other things as I go along :) > > At present I have created some sample TEI XML files, and have used the > <particDesc> in the TEI header to record biographical information about > each author. The body of each entry contains bibliographic references > in a number of categories, and I've marked these up using divs for the > categories and <bibl> elements sitting as list items within. > > I'm aiming to use elements rather than attributes to hold the data > wherever possible, as my limited understanding of full text search > suggests this is a good idea for creating indexes. However, some data > is held as attributes, as in the sample div below. Does this make it > any more difficult to search/index? > > Using eXist, I've managed to produce a very basic local hosted website > to view some of my sample TEI. Any advice on this general approach to > creating an annotated bibliography would be most welcome, as would any > resources or examples I should consult. > > Best regards, > Chris > > > <div type="autobiographical"> > <head>Autobiographical articles</head> > <list> > <item> > <bibl> > <author>Person, Ann</author> > <title level="a">Reflections</title> > <title level="j">Some Journal</title> > <biblScope type="vol">1</biblScope> > <date when="1992">1992</date> > <biblScope type="pp">3-4</biblScope> > </bibl> > <note>Annotation and summary of text ...</note> > </item> > </list> > </div> |