exist-teixml Mailing List for eXist-db (Page 5)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Chris,

I'm glad you found the workshop valuable.  Sorry for my slow reply!

The structure for your data makes sense to me.  Admittedly I haven't
done an analogous project with annotated bibliographies.  I'd invite
others to may have suggestions about ways to structure annotated
bibliographies in TEI.  Also, you might consult
http://tei-l.markmail.org/search/?q=bibliographies for recent
discussions on doing bibliographies in TEI).  And TEI-L would be
another good place to ask.

Let me focus, then, on the question of how eXist's indexing
technologies might factor into the way you store and query your TEI;
along the way I'll touch on the elements vs. attributes question.
Your example entry is very well suited for rapid searches using
eXist's structural, range, and lucene full text indexes.  Indexes
speed up queries, because rather than "brute-force" scanning through
your documents (potentially many gigabytes on disk) to find the
answer, eXist can find the answer indexes (much smaller, and likely to
fit in RAM; in-memory index searches the fastest).

Structural Index:  eXist's structural index keeps track of every
element and attribute and their structural placements in the database.
 This makes pure XPath-based queries very fast, e.g. //tei:bibl would
quickly return all of the bibl elements in the database.  Or
//tei:date[not(@when)] would quickly return all of the date elements
that do not have a @when attribute.

The structural index is always on by default.  Other indexes must be
manually configured and applied by you:

Range Index:  A range index stores all of the values of a specific
element or attribute, and it greatly speeds up queries on the values
of an element or attribute.  Range indexes are typed as string,
integer, date, year, etc.  For example, if you wanted to search for
entries in the 1990s, you could apply a year-based range index to the
@when attribute, speeding up queries like:

  //tei:bibl[tei:date/@when ge 1990 and tei:date/@when lt 2000]

If you wanted to query all entries from a specific journal, you could
apply string-based range indexes to (1) the @level attribute, since
that distinguishes between different title types and (2) the title
element:

  //tei:title[@level eq 'j'][. eq 'Some Journal']

Notice a pattern here: whenever you filter an expression with a
predicate (the bit in square brackets) that uses comparisons (equals,
less than, greater than), you can most likely apply a range index.
Think of a range index as a very literal dictionary of the exact
values.

Notice also that we have been using plain, pure XPath and XQuery here.
 Your queries simply get faster by virtue of the built-in structural
index and the user-specified range indexes.

Lucene Full Text Index: Whereas a range index is very literal and
enables queries of the full value of an element or attribute, a full
text index does a lot more work behind the scenes: it identifies
"words" in the text in an element or attribute (typically, by treating
a space as the thing that separates words) and stores those words in
an index; this process is called "tokenization".  Full text indexes
also let you search with wildcards like * or ?.  So it makes sense to
apply full text indexes in cases where you have many words.  I
wouldn't apply a full text index to the @level attribute in your
example, since that contains codes (e.g., "a" and "j"); nor would I
apply it to the @when attribute, since that contains years.  But I
would think about applying one to the title elements, or to the note
elements.  For example, if you want to search titles that have some
words, you can apply a lucene index to the "note" element use this
kind of query:

  //tei:bibl[ft:query(tei:note, 'emotion')]

I'll leave it as an exercise for you to think about whether you might
ever want a range index and a full text index on the same element,
e.g., title.  (There's no reason you can't.)

To configure and apply these indexes above, you would need to create a
collection.xconf file and place it in the /db/system/config/db...
collection corresponding to your own data's collection in the database
(e.g., for data stored in /db/myapp, you would put the index
configuration file in /db/system/config/db/myapp/collection.xconf):

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:tei="http://www.tei-c.org/ns/1.0">

        <!-- Disable the legacy full text index -->
        <fulltext default="none" attributes="false"/>

        <!-- Lucene index configuration -->
        <lucene>
		    <!-- The standard analyzer will ignore stopwords like 'the', 'and' -->
            <analyzer
class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
		    <!-- Whitespace analyzer includes stopwords like 'the', 'and' -->
            <!--analyzer
class="org.apache.lucene.analysis.WhitespaceAnalyzer"/-->
            <text qname="tei:note"/>
            <text qname="tei:title"/>
        </lucene>

        <!-- Range index configuration -->
        <create qname="@level" type="xs:string"/>
        <create qname="title" type="xs:string"/>
        <create qname="@when" type="xs:date"/>
        <!-- Note on @when: "eXist can only use a range index
            if all values within that index are valid instances
            of the defined index type.  So every date has to be
            an xs:date and if there's just one exception, the
            index will no longer be used. -->
    </index>
</collection>

There's quite a bit that you can do to customize your full text search
indexes with eXist, including using Lucene query syntax, expressing
your query in XML, selecting various Lucene analyzers (with
purpose-built and/or language-specific features).  Here are some links
about these topics:

  1. Configuring Database Indexes: http://exist-db.org/indexing.html
  2. Tips on Writing Queries: http://exist-db.org/tuning.html#d1973e562
  3. Lucene-based Full Text Index: http://exist-db.org/lucene.html
  4. Lucene query syntax:
http://lucene.apache.org/java/2_9_3/queryparsersyntax.html

Nothing I've discussed mandates that you store your data in elements
vs. attributes.  That said, I believe that there is currently a
limitation on applying Lucene full text indexes to attribute data in
eXist (no such restriction on range indexes though).  Perhaps one of
the core eXist developers will comment on that.

I hope this answer, belated as it may be, helps.

Cheers,
Joe

On Tue, Oct 18, 2011 at 5:59 PM, Christopher Thomson
<chr...@ca...> wrote:
> Hello,
>
> First of all I wanted to say Joe's introductory workshop on eXist at
> Oxford this year was really valuable, and although it's taken me a while
> to post something to this list, I think it's a great idea and I hope
> there are others out there interested in getting started with eXist. For
> my own part, I'd like to gather some feedback on a project I'm working
> on, a digital edition of an annotated bibliography that has existed for
> a few years in print form.  It's a modest project, and I'm using it to
> learn XQuery and half a dozen other things as I go along :)
>
> At present I have created some sample TEI XML files, and have used the
> <particDesc> in the TEI header to record biographical information about
> each author.  The body of each entry contains bibliographic references
> in a number of categories, and I've marked these up using divs for the
> categories and <bibl> elements sitting as list items within.
>
> I'm aiming to use elements rather than attributes to hold the data
> wherever possible, as my limited understanding of full text search
> suggests this is a good idea for creating indexes.  However, some data
> is held as attributes, as in the sample div below.  Does this make it
> any more difficult to search/index?
>
> Using eXist, I've managed to produce a very basic local hosted website
> to view some of my sample TEI. Any advice on this general approach to
> creating an annotated bibliography would be most welcome, as would any
> resources or examples I should consult.
>
> Best regards,
> Chris
>
>
> <div type="autobiographical">
> <head>Autobiographical articles</head>
> <list>
> <item>
> <bibl>
> <author>Person, Ann</author>
> <title level="a">Reflections</title>
> <title level="j">Some Journal</title>
> <biblScope type="vol">1</biblScope>
> <date when="1992">1992</date>
> <biblScope type="pp">3-4</biblScope>
> </bibl>
> <note>Annotation and summary of text ...</note>
> </item>
> </list>
> </div>

2011	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (8)	Sep (4)	Oct (1)	Nov (5)	Dec (16)
2012	Jan (4)	Feb	Mar (14)	Apr	May	Jun (1)	Jul (5)	Aug	Sep	Oct	Nov	Dec (19)
2013	Jan	Feb (1)	Mar (1)	Apr (3)	May	Jun	Jul	Aug (6)	Sep (3)	Oct	Nov	Dec
2014	Jan	Feb	Mar (6)	Apr	May	Jun	Jul	Aug	Sep	Oct (1)	Nov (13)	Dec (1)
2016	Jan	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2017	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2019	Jan (1)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec

exist-teixml Mailing List for eXist-db (Page 5)

eXist-db is a feature rich Open Source native XML database

exist-teixml — Discussions about eXist and TEI