Diversity Engine Wiki

Status: Pre-Alpha

Brought to you by: mikemat

FirstAppIndex

Indexing and Search

Solr

The Diversity Engine testbed uses Solr for indexing and search functionality. Solr is a popular enterprise search platform that has many features that suit it well for our purposes. The first is that there is a very clear XML based API for adding documents to the index. To add Diversity Engine analysis to the index, the testbed provides a mechanism for converting Living Knowledge XML to Solr XML. This process is controlled by the converter XML file specified in the config file.

<indexer solrHomeDir="solr/solr" solrDataDir="solr/solr/data"
    converter="conf/testbed/tutorial-lk2solr-final.xml"/>

The converter XML file has the following format

<lktosolr>
        <field solr="per" annotation="ENTITIES_CLEAN" value="$text"
            filter="org.diversityengine.solr.converter.filters.PerValueFilter"/>
        <field solr="loc" annotation="ENTITIES_CLEAN" value="$text"
            filter="org.diversityengine.solr.converter.filters.LocValueFilter"/>
        <field solr="keywords" annotation="TOP_ENTITIES" value="$text" />
        <field solr="pubdate" annotation="metainfo:lktext" value="date"
            type="date"/>
</lktosolr>

The important fields to add are
* solr - The name of the solr field
* annotation - The LK XML annotation that provides the annotation, or metainfo
* value - The value to assign the solr field
* $text = the text for the annotation
* $doctext = the document text to which the annotation applies
* [name] the name associated with an annotation (metainfo, STEM, etc)
* xpath:[xpath] = an xpath expression that evaluates to a string for the given annotation
* type - Only data is supported (the others are text)
* filter - Java class that can modify they value in some way (remove, add, change, etc)

This is easiest to see by example. In the previous section, we added fact analysis. A snippet of the fact analysis XML is shown below.

<?xml version="1.0" encoding="UTF-8"?>
<lk-annotation>
  <meta-info>
    <tag name="base">....20090729085530-00000.arc.15521713.lktext.xml</tag>
    <tag name="annotator">Advanced-Facts-Extractor-v3</tag>
  </meta-info>
  <annotation scope="....20090729085530-00000.arc.15521713.lktext.xml" 
        provides="yago-entities">
    <e id="1" start="#649" end="#668">Federal_Constitutional_Court_of_Germany</e>
    <e id="2" start="#151" end="#158">Slovakia</e>
    <e id="3" start="#638" end="#647">The_German</e>
    ....
  </annotation>
  <annotation provides="facts">
    ....
    <e id="26" on="#2">
      <entity-information>
        <id>Slovakia</id>
        <indices>
          <index start="#151" end="#158" />
        </indices>
        <facts>
          <hasPopulationDensity>111#/km^2</hasPopulationDensity>
          <hasGini>19.5</hasGini>
          <type>wikicategory_European_Union_member_economies</type>
          <type>wikicategory_European_Union_member_states</type>
          <type>wikicategory_European_sovereign_states</type>
          <type>wikicategory_Landlocked_countries</type>
          <type>wikicategory_OECD_member_economies</type>
          <type>wikicategory_Slavic_countries</type>
          <type>wikicategory_States_and_territories_established_in_1993</type>
          <type>wordnet_administrative_district_108491826</type>
          <type>wordnet_country_108544813</type>
          <type>wordnet_district_108552138</type>
          <type>wordnet_economy_108366753</type>
          <type>wordnet_group_100031264</type>
          ....
        </facts>
      </entity-information>
    </e>
  </annotation>
</lk-annotation>

Let us assume we want to index the yago-entities into one Solr field and then to index all yago entities that are of type country into another field. The following lines would be added to the converter XML file.

    <field solr="yago" annotation="yago-entities" value="$text" />
    <field solr="yago-country" annotation="facts"
        value="xpath:/entity-information[facts/type/text()
            ='wordnet_country_108544813']/id/text()" />

For the first field, all of the entities (\<e>) in the annotation with provides = yago-entities are extracted and a solr field with name yago is created for each \<e> with the value of the text of the element (Federal_Constitutional_Court_of_Germany, Slovakia, The_German, etc). For the second field, the xpath expression is evaluated for the XML under each \<e> in the annotation with provides = facts. If the xpath returns a value, a solr field with name yago-country is created with that value. The resulting solr document is

<doc boost="1.0">
.....
    <field name="yago-country" boost="1.0">Slovakia</field>
    <field name="yago-country" boost="1.0">France</field>
.....
    <field name="yago" boost="1.0">Federal_Constitutional_Court_of_Germany</field>
    <field name="yago" boost="1.0">Slovakia</field>
    <field name="yago" boost="1.0">The_German</field>
    <field name="yago" boost="1.0">Jens-Peter_Bonde</field>
.....
</doc>

After changing the file, you can index by typing the following command

*** apps/testbed -run convert-solr,index conf/testbed/tutorial-application.xml ***

You can re-deploy the testbed by entering

*** ant deploy-testbed ***

The solr file in devapps/example/data/output/solr and the Solr index both contain the new fields

Next: Customizing the web app

Wiki: FirstAppPipeline
Wiki: FirstAppWebapp
Wiki: Tutorial

Diversity Engine Wiki

FirstAppIndex

Indexing and Search

Solr

Related