Re: [Exist-open] Fwd: Cross reference Index - why is so slow

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hey Claudius,

thank you for the insight.

But with the solution you presented, I ask myself why should I use 
eXistDb at all?
Then I could simply use an XML parser and read the fields directly from 
the files at file level or use another document-based DB that can do XML.
But I want to use the features that are in a complete system - I think 
the diversions with a custom search engine for combined fields is 
somehow good, but also thought around the corner.
Surely there must be a solution to the problem in eXistdb?

Best regards
Lars

Am 03.09.24 um 11:27 schrieb Claudius Teodorescu:
> Dear Lars,
>
> My name is Claudius Teodorescu, and I am currently working for the 
> Academy of Mainz.
>
> For the cases with publishing dictionaries (around 8, see 
> https://clre.solirom.ro), I have chosen the static website approach, 
> along with static indexes, and a static search engine that is 
> browser-based and is written in Rust and compiled to WebAssembly.
>
> One can see some test data from the DWDS dictionary (around 250,000 
> entries, and four indexes) at 
> https://claudius-teodorescu.gitlab.io/dwds-site/.
>
> Maybe this is of any help for you.
>
> I think that I do not have to mention that the indexing takes seconds, 
> as I have written the indexing engine in Rust language. :)
>
> Best regards,
> Claudius
>
> On Tue, 3 Sept 2024 at 11:30, Lars Scheidelerl 
> <sch...@sa...> wrote:
>
>     Dear Boris,
>
>     thank you very much for the answer.
>
>     The person data and the article data are in different collections.
>     For every person we have one file, for every article we have one file.
>
>     We don't have a fixed order of when data is imported or written in,
>     as I understand it, what you are suggesting would make it
>     necessary for
>     all the person or item data to
>     always be available before the index of the other collection, so
>     that it
>     can be accessed in the index of the other.
>
>     Maybe we have to, like you pointed out, tests the module functions
>     for
>     better performance.
>     But in the past however, after several different approaches that
>     we have
>     already tried, we have realized that querying the data,
>     no matter how good it is in eXide, for example, is significantly
>     slower
>     when it is used for the index.
>     Could it be that the structural index is not used when re-indexing?
>     Our assumption was that the data is iterated over differently or that
>     new blobs are written over the structure and content in addition to
>     indexing.
>     In other words, that it validates, saves and indexes the data at the
>     same time as re-indexing.
>
>     Other approaches include the use of cache or a helper file, where the
>     fields are composed beforehand and then indexed accordingly,
>     but this also takes a long time and unfortunately also blocks
>     working on
>     the files.
>     So if we write a helper file with all fields in eXide, the whole
>     process
>     takes about 90s, if we work with the fields as we would build the
>     index,
>     i.e. with xml:id in the helper file, ~110s.
>     Re-Index 2-4 hours. Not really understandable.
>
>     We are now primarily trying to improve this using the xml:id/id()
>     function, but not much hope to improve the re-index on a
>     production scale.
>     But if all data is re-indexed, i.e. the xml:id fields are not
>     available,
>     it is in vain, too.
>
>     Would love to learn more about this topic and continue to share
>     experiences.
>
>     Lars
>
>     Am 03.09.24 um 00:49 schrieb Boris Lehečka:
>     > Dear Lars,
>     >
>     >     I have similar issues with indexing dictionaries: my indexing
>     > procedure ask for data from the taxonomy (like expansions of
>     > abbreviations) and sometimes indexing a dictionary with about
>     36,000
>     > entries took a whole day.
>     >
>     >     I don't remember who (Juri Leino, I guess), pointed out to
>     me that
>     > the index is saved only after the whole document is parsed. After
>     > moving each dictionary entry to a separate file, indexing took much
>     > less time (several hours).
>     >
>     >     However, this does not seem to be the cause of your problem.
>     >
>     >     In my opinion, your indexing code (in the module) is very
>     > complicated, sometimes it can be much simpler, for example without
>     > explicit conversion to string (like in tei:persName[string(@ref) eq
>     > $identifier ...]), or normalizing spaces in ip:getFullText
>     (full-tex t
>     > search usually uses only the parts between spaces).
>     >
>     >     My suggestion is following: first, create an index for
>     persons in
>     > a separate collection (with separate collection.xconf): compute
>     fields
>     > with values you will query or want to return when you index the
>     > articles (in different collection). And second, use Lucene and
>     > fulltext search in your "index-persons" module to find data in the
>     > index from the first phase.
>     >
>     >   This is just an idea, not tested, I hope someone else is much
>     more
>     > experienced in the magic of indexing.
>     >
>     >    Best,
>     >
>     >     Boris Lehečka
>     >
>     > Dne 02.09.2024 v 16:57 Lars Scheidelerl napsal(a):
>     >> Hey,
>     >>
>     >> we assume that we are not using the index in our project as
>     intended.
>     >> Because when we try to build the index we have created, it takes a
>     >> very long time.
>     >>
>     >> We have two collections, one with a data stack of 687 in which
>     data
>     >> is stored, and one with 400 xml where articles are are stored.
>     >>
>     >> For the personal article we want certain information from the
>     >> articles and vice versa.
>     >>
>     >> Person XML:
>     >>
>     >> <person xml:id="i0c9ab7e2-2e21-39ff-aea8-c56ad4702a7f"
>     status="safe"
>     >> modified="2024-07-30T13:26:09.154+02:00">
>     >>             <name>Marcanton Zimara</name>
>     >>             <identifier
>     >> preferred="YES">https://d-nb.info/gnd/120156784</identifier>
>     >>             <alternateName>Marcusantonius Zimara</alternateName>
>     >>             <alternateName>Marcus Anthonius Zimara</alternateName>
>     >>             <alternateName>Antonius Zimara</alternateName>
>     >>             <alternateName>M. Antonius Zimarra</alternateName>
>     >>             <alternateName>Marc Antoine Zimara</alternateName>
>     >>             <alternateName>M. Anto. Zimare</alternateName>
>     >>             <alternateName>Marco A. Zimara</alternateName>
>     >>             <alternateName>Marcus A. Zimara</alternateName>
>     >>             <alternateName>Marc Ant. Zimara</alternateName>
>     >>             <alternateName>Marcantonio Zimara</alternateName>
>     >>             <alternateName>Marcus Antonius Zimara</alternateName>
>     >>             <alternateName>Marcus Antonius Zimarra</alternateName>
>     >>             <alternateName>Marcianto Zimare</alternateName>
>     >>             <alternateName>Marco Antonio Zimarra</alternateName>
>     >>             <alternateName>Marco Antonio Zimare</alternateName>
>     >>             <birthDate>1460</birthDate>
>     >>             <deathDate>1532</deathDate>
>     >> <description>JWO</description>
>     >>             <sortableName>Zimara, Marcanton </sortableName>
>     >>         </person>
>     >>
>     >> Articel XML:
>     >>
>     >> <?xml version="1.0" encoding="UTF-8"?>
>     >> <TEI xmlns="http://www.tei-c.org/ns/1.0">
>     >>     <teiHeader>
>     >>         <fileDesc>
>     >>             <titleStmt>
>     >>                 <title>a nihilo nihil fit</title>
>     >>                 <author>
>     >>                     <persName
>     >> ref="/db/projects/jwo/data/lists/personenListe.xml#BS_d1e509"
>     >> xml:id="author_BS_d1e509">
>     >> <forename>Marcanton</forename>
>     >> <surname>Zimara</surname>
>     >>                     </persName>
>     >>                 </author>
>     >>             </titleStmt>
>     >>             <sourceDesc>
>     >>                 <p xml:id="p_sourceDesc_igw_tvr_pzb">born
>     digital</p>
>     >>             </sourceDesc>
>     >>         </fileDesc>
>     >>     </teiHeader>
>     >>     <text xml:lang="de-DE" type="main">
>     >>         <body>
>     >>             <div1 xml:id="div1_d1e23_2">
>     >>                 <p xml:id="p_d1e27_1" n="1"> Lorem ipsum dolor sit
>     >> amet, consectetur adipiscing elit, sed do eiusmod tempor
>     incididunt
>     >> ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
>     >> nostrud exercitation <persName xml:id="persName_sa123"
>     >> ref="https://d-nb.info/gnd/120156784"
>     >> rend="smallcaps">Zimara</persName> ullamco laboris nisi ut
>     aliquip ex
>     >> ea commodo consequat. Duis aute irure dolor in reprehenderit in
>     >> voluptate velit esse cillum dolore eu <persName
>     >> xml:id="persName_s123"
>     >> ref="https://d-nb.info/gnd/120156784">Zimara</persName> fugiat
>     nulla
>     >> pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
>     >> culpa qui officia deserunt mollit anim id est laborum. </p>
>     >>             </div1>
>     >>         </body>
>     >>     </text>
>     >> </TEI>
>     >>
>     >> Collection.xconf:
>     >>
>     >> <collection xmlns="http://exist-db.org/collection-config/1.0">
>     >>     <index
>     xmlns:gndo="https://d-nb.info/standards/elementset/gnd#"
>     >> xmlns:owl="http://www.w3.org/2002/07/owl#"
>     >> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>     >> xmlns:xs="http://www.w3.org/2001/XMLSchema">
>     >>         <lucene>
>     >>             <module
>     uri="http://place.sok.org/xquery/index-persons"
>     >> prefix="ip"
>     >>
>     at="xmldb:exist:///db/apps/sok-application/modules/index-persons.xqm"/>
>     >>             <analyzer
>     >> class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
>     >>             <analyzer
>     >>
>     class="org.exist.indexing.lucene.analyzers.NoDiacriticsStandardAnalyzer"
>
>     >> id="nodiacritics"/>
>     >>             <text qname="adcache">
>     >>                 <field name="basicId"
>     >> expression="//person/@id/string()"/>
>     >>                 <field name="fullname"
>     >> expression="string(./basic/person/name)"/>
>     >>                 <field name="gndURI"
>     >> expression="string(./basic/person/identifier[@preferred eq
>     'YES'])"/>
>     >>                 <field name="gndID"
>     >>
>     expression="substring-after(./basic/person/identifier[@preferred eq
>     >> 'YES']/string(), '/gnd/')"/>
>     >>                 <field name="status"
>     >> expression="./basic/person/@status/string()"/>
>     >>                 <field name="articleID"
>     >> expression="ip:getArticleFromPersonCache(.)"/>
>     >>                 <field name="articleRole"
>     >> expression="ip:getArticleRoleFromPersonCache(.)"/>
>     >>                 <field name="fulltext"
>     expression="ip:getFullText(.)"/>
>     >>                </text>
>     >>         </lucene>
>     >>     </index>
>     >> </collection>
>     >>
>     >> Module Functions:
>     >>
>     >> module namespace ip                 =
>     >> "http://place.sok.org/xquery/index-persons";
>     >>
>     >> declare namespace basic               =
>     >> "http://place.sok.org/xquery/basic" ;
>     >>
>     >> declare namespace xs                =
>     "http://www.w3.org/2001/XMLSchema";
>     >> declare namespace tei               =
>     "http://www.tei-c.org/ns/1.0";
>     >> declare namespace util              =
>     "http://exist-db.org/xquery/util";
>     >>
>     >> declare function ip:getArticleFromPersonCache($adcache as
>     element())
>     >> as xs:string* {
>     >>     let $parentCollectionPath as xs:anyURI? :=
>     >> ip:getParentCollection($adcache),
>     >>         $basicId as xs:string :=
>     $adcache/basic/person/@id/string(),
>     >>         $identifier as xs:string? :=
>     >> $adcache/basic/person/identifier[@preferred eq 'YES']/string(),
>     >>         $listId as xs:string? := collection(
>     >> $variables:jwo-lists-path )/tei:TEI//tei:person[
>     >> basic:basic-id-from-url(string(@sameAs)) eq
>     $basicId]/@xml:id/string(),
>     >>         $foundInDocumentIds as xs:string* :=
>     >>             if ( matches($parentCollectionPath,'prepublish') )
>     >>             then
>     >>             (
>     >>
>     collection($parentCollectionPath)/tei:TEI[./tei:teiHeader//tei:idno[1]/string()
>
>     >> ne
>     >>
>     ''][matches(replace((normalize-space('||'||string-join(distinct-values(.//tei:persName[@ref]/@ref/string())
>
>     >> ! replace(.,'.*?#',''),
>     >>
>     '||')||'||')||normalize-space('||'||string-join(distinct-values(.//tei:persName[@source]/@source/string())
>
>     >> ! replace(.,'.*?#',''),
>     >>
>     '||')||'||')),'\|{4}',''),'\|{2}('||$basicId||'|'||$listId||'|'||$identifier||')\|{2}')]//tei:idno/string()
>     >>             )
>     >>             else
>     >>             (
>     >>
>     collection($parentCollectionPath)/tei:TEI[./tei:teiHeader//tei:idno[1]/string()
>
>     >> ne
>     >>
>     ''][matches(replace((normalize-space('||'||string-join(distinct-values(.//tei:persName[@ref][not(parent::editor)]/@ref/string())
>
>     >> ! replace(.,'.*?#',''),
>     >>
>     '||')||'||')||normalize-space('||'||string-join(distinct-values(.//tei:persName[@source][not(parent::editor)]/@source/string())
>
>     >> ! replace(.,'.*?#',''),
>     >>
>     '||')||'||')),'\|{4}',''),'\|{2}('||$basicId||'|'||$listId||'|'||$identifier||')\|{2}')]//tei:idno/string()
>     >>             )
>     >>     return
>     >>     (
>     >>         $foundInDocumentIds
>     >>     )
>     >> };
>     >>
>     >> declare
>     >> function ip:getAuthenticatedArticleCollection($collection-name as
>     >> xs:string) as item()* {
>     >>     if ($collection-name eq 'prepublish') then
>     >> xmldb:xcollection($variables:jwo-prepublish-path) else
>     >> xmldb:xcollection($variables:jwo-publish-path)
>     >> };
>     >>
>     >> declare function
>     >> ip:getPersNamesInCollectionFromCachedPerson($cached-person as
>     >> element(), $collection-name as xs:string) as element()* {
>     >>     let $basicId := $cached-person/basic/person/@id/string()
>     >>     let $identifier :=
>     >> $cached-person/basic/person/identifier[@preferred eq
>     'YES']/string()
>     >>     let $listId := collection( $variables:jwo-lists-path
>     >> )/tei:TEI/tei:text[1]/tei:body[1]/tei:listPerson[1]/tei:person[
>     >> basic:basic-id-from-url(string(@sameAs)) eq
>     $basicId]/@xml:id/string()
>     >>     let $collection
>     >> := ip:getAuthenticatedArticleCollection($collection-name)
>     >>     return (
>     >>         $collection//tei:persName[
>     >>             string(@ref) eq $identifier
>     >>             or ip:getIdFromUri(string(@ref)) eq $listId
>     >>             or substring-after(string(@ref), '#') eq $basicId
>     >>             or substring-before(substring-after(string(@source),
>     >> 'persons/'), '.xml') eq $basicId]
>     >>     )
>     >> };
>     >>
>     >> declare function ip:getRoleFromPersName($persName as element(),
>     >> $collection-name as xs:string) as xs:string? {
>     >>     if ($persName/ancestor::*/local-name() = 'author')
>     >>     then ( 'author' )
>     >>     else if ($persName/ancestor::*/local-name() = 'editor')
>     >>     then (
>     >>         if ($collection-name eq 'prepublish') then ( 'editor' )
>     >>         (: Ignore editors in published case :)
>     >>         else ()
>     >>     )
>     >>     else ( 'annotated' )
>     >> };
>     >>
>     >> declare function
>     ip:getArticleRoleFromPersonCache($cached-person as
>     >> element(), $collection-name as xs:string) as xs:string* {
>     >>     let $allPersNames :=
>     >>         if ($collection-name ne 'prepublish') then (
>     >> ip:getPersNamesInCollectionFromCachedPerson($cached-person,
>     >> $collection-name)[not(ancestor::*/local-name() = 'editor')]
>     >>         ) else (
>     >> ip:getPersNamesInCollectionFromCachedPerson($cached-person,
>     >> $collection-name)
>     >>         )
>     >>     return (
>     >>         for $articleGroup in $allPersNames
>     >>         let $articleID :=
>     $articleGroup/ancestor::tei:TEI//tei:idno[1]
>     >>         group by $articleID
>     >>         return (
>     >>             $articleID || '@@' || string-join(distinct-values(
>     >>                 for $persName in $articleGroup
>     >>                 let $role := ip:getRoleFromPersName($persName,
>     >> $collection-name)
>     >>                 order by $role
>     >>                 return $role
>     >>             ), ' ')
>     >>         )
>     >>     )
>     >> };
>     >>
>     >> declare function ip:getParentCollection($element as node()) as
>     >> xs:anyURI? {
>     >>     resolve-uri('../../', $element/base-uri())
>     >> };
>     >>
>     >> declare function ip:getIdFromUri($uri as xs:string) as xs:string {
>     >>     substring-after($uri, '#')
>     >> };
>     >>
>     >> declare function basic:basic-id-from-url($url as xs:string) as
>     >> xs:string? {
>     >>     substring-after(substring-before($url, '?dataset'),'persons/')
>     >> };
>     >>
>     >> declare function ip:getFullText($element) as xs:string {
>     >>     let $parentCollection as xs:anyURI? :=
>     >> ip:getParentCollection($element)
>     >>     return
>     >>     (
>     >>        normalize-space(string-join(
>     >>                     let $basicId as xs:string :=
>     >> $element/basic/person/@id/string(),
>     >>                         $identifier as xs:string* :=
>     >> $element/basic/person/identifier[@preferred eq 'YES']/string(),
>     >>                         $listId as item()* := collection(
>     >> $variables:lists-path
>     >> )/tei:TEI/tei:text[1]/tei:body[1]/tei:listPerson[1]/tei:person[
>     >> basic:basic-id-from-url(string(@sameAs)) eq
>     $basicId]/@xml:id/string(),
>     >>                         $element-string as xs:string* :=
>     >> string($element),
>     >>                         $collections as item()* :=
>     >> collection($parentCollection)//tei:persName[string(@ref) eq
>     >> $identifier or ip:getIdFromUri(string(@ref)) eq $listId or
>     >> substring-after(string(@ref), '#') eq $basicId or
>     >> substring-before(substring-after(string(@source), 'persons/'),
>     >> '.xml') eq $basicId][1],
>     >>                         $element-cache-string as xs:string* :=
>     >> string-join(for $found-element in $collections where
>     >> count($found-element) > 0 return $found-element, ' ')
>     >>                     return
>     >>                     (
>     >> $element-string,$element-cache-string
>     >>                     ),' '))
>     >>     )
>     >> };
>     >>
>     >> Please help.
>     >>
>     >>
>     >
>     >
>     > _______________________________________________
>     > Exist-open mailing list
>     > Exi...@li...
>     > https://lists.sourceforge.net/lists/listinfo/exist-open
>
>     -- 
>     Lars Scheideler
>     - wiss. technischer Mitarbeiter -
>     Althochdeutsches Wörterbuch & Digital Humanities
>
>     Sächsische Akademie der Wissenschaften zu Leipzig
>     Karl-Tauchnitz-Straße 1
>     04107 Leipzig
>
>     sch...@sa...
>     www.saw-leipzig.de <http://www.saw-leipzig.de>
>
>
>
>     _______________________________________________
>     Exist-open mailing list
>     Exi...@li...
>     https://lists.sourceforge.net/lists/listinfo/exist-open
>
>
>
> -- 
> Cu stimă,
> Claudius Teodorescu

-- 
Lars Scheideler
- wiss. technischer Mitarbeiter -
Althochdeutsches Wörterbuch & Digital Humanities

Sächsische Akademie der Wissenschaften zu Leipzig
Karl-Tauchnitz-Straße 1
04107 Leipzig

sch...@sa...
www.saw-leipzig.de

Re: [Exist-open] Fwd: Cross reference Index - why is so slow

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Fwd: Cross reference Index - why is so slow