Re: [Exist-open] Fwd: Cross reference Index - why is so slow

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Lars,

I’ve had a few observations from using eXist-db that might be of interest.

  *   Queries that take a long time to run and access documents in different collections can run be made to run in less time by placing the documents into a single collection.

  *   Simultaneously reading from and writing to the same collection can cause blocks.

  *   The tracing function in MoneX is useful to see what indexes are being used by a query.

It looks like your Person XML and Article XML are using different namespaces, so that would provide a way differentiate the two kinds of documents if they are in the same collection.

I hope this is helpful in some way.

Kind regards,
Vincent

_____________________________________________
Vincent M. Lizzi
Head of Information Standards | Taylor & Francis Group
vin...@ta...<mailto:vin...@ta...>

Information Classification: General
From: Lars Scheidelerl <sch...@sa...>
Sent: Tuesday, September 3, 2024 7:46 AM
To: Claudius Teodorescu <cla...@gm...>; exi...@li...
Subject: Re: [Exist-open] Fwd: Cross reference Index - why is so slow

Hey Claudius,

thank you for the insight.

But with the solution you presented, I ask myself why should I use eXistDb at all?
Then I could simply use an XML parser and read the fields directly from the files at file level or use another document-based DB that can do XML.
But I want to use the features that are in a complete system - I think the diversions with a custom search engine for combined fields is somehow good, but also thought around the corner.
Surely there must be a solution to the problem in eXistdb?

Best regards
Lars
Am 03.09.24 um 11:27 schrieb Claudius Teodorescu:
Dear Lars,

My name is Claudius Teodorescu, and I am currently working for the Academy of Mainz.

For the cases with publishing dictionaries (around 8, see https://clre.solirom.ro<https://clre.solirom.ro>), I have chosen the static website approach, along with static indexes, and a static search engine that is browser-based and is written in Rust and compiled to WebAssembly.

One can see some test data from the DWDS dictionary (around 250,000 entries, and four indexes) at https://claudius-teodorescu.gitlab.io/dwds-site/<https://claudius-teodorescu.gitlab.io/dwds-site/>.

Maybe this is of any help for you.

I think that I do not have to mention that the indexing takes seconds, as I have written the indexing engine in Rust language. :)

Best regards,
Claudius

On Tue, 3 Sept 2024 at 11:30, Lars Scheidelerl <sch...@sa...<mailto:sch...@sa...>> wrote:
Dear Boris,

thank you very much for the answer.

The person data and the article data are in different collections.
For every person we have one file, for every article we have one file.

We don't have a fixed order of when data is imported or written in,
as I understand it, what you are suggesting would make it necessary for
all the person or item data to
always be available before the index of the other collection, so that it
can be accessed in the index of the other.

Maybe we have to, like you pointed out, tests the module functions for
better performance.
But in the past however, after several different approaches that we have
already tried, we have realized that querying the data,
no matter how good it is in eXide, for example, is significantly slower
when it is used for the index.
Could it be that the structural index is not used when re-indexing?
Our assumption was that the data is iterated over differently or that
new blobs are written over the structure and content in addition to
indexing.
In other words, that it validates, saves and indexes the data at the
same time as re-indexing.

Other approaches include the use of cache or a helper file, where the
fields are composed beforehand and then indexed accordingly,
but this also takes a long time and unfortunately also blocks working on
the files.
So if we write a helper file with all fields in eXide, the whole process
takes about 90s, if we work with the fields as we would build the index,
i.e. with xml:id in the helper file, ~110s.
Re-Index 2-4 hours. Not really understandable.

We are now primarily trying to improve this using the xml:id/id()
function, but not much hope to improve the re-index on a production scale.
But if all data is re-indexed, i.e. the xml:id fields are not available,
it is in vain, too.

Would love to learn more about this topic and continue to share experiences.

Lars

Am 03.09.24 um 00:49 schrieb Boris Lehečka:
> Dear Lars,
>
>     I have similar issues with indexing dictionaries: my indexing
> procedure ask for data from the taxonomy (like expansions of
> abbreviations) and sometimes indexing a dictionary with about 36,000
> entries took a whole day.
>
>     I don't remember who (Juri Leino, I guess), pointed out to me that
> the index is saved only after the whole document is parsed. After
> moving each dictionary entry to a separate file, indexing took much
> less time (several hours).
>
>     However, this does not seem to be the cause of your problem.
>
>     In my opinion, your indexing code (in the module) is very
> complicated, sometimes it can be much simpler, for example without
> explicit conversion to string (like in tei:persName[string(@ref) eq
> $identifier ...]), or normalizing spaces in ip:getFullText (full-tex t
> search usually uses only the parts between spaces).
>
>     My suggestion is following: first, create an index for persons in
> a separate collection (with separate collection.xconf): compute fields
> with values you will query or want to return when you index the
> articles (in different collection). And second, use Lucene and
> fulltext search in your "index-persons" module to find data in the
> index from the first phase.
>
>   This is just an idea, not tested, I hope someone else is much more
> experienced in the magic of indexing.
>
>    Best,
>
>     Boris Lehečka
>
> Dne 02.09.2024 v 16:57 Lars Scheidelerl napsal(a):
>> Hey,
>>
>> we assume that we are not using the index in our project as intended.
>> Because when we try to build the index we have created, it takes a
>> very long time.
>>
>> We have two collections, one with a data stack of 687 in which data
>> is stored, and one with 400 xml where articles are are stored.
>>
>> For the personal article we want certain information from the
>> articles and vice versa.
>>
>> Person XML:
>>
>> <person xml:id="i0c9ab7e2-2e21-39ff-aea8-c56ad4702a7f" status="safe"
>> modified="2024-07-30T13:26:09.154+02:00">
>>             <name>Marcanton Zimara</name>
>>             <identifier
>> preferred="YES">https://d-nb.info/gnd/120156784<https://d-nb.info/gnd/120156784></identifier>
>>             <alternateName>Marcusantonius Zimara</alternateName>
>>             <alternateName>Marcus Anthonius Zimara</alternateName>
>>             <alternateName>Antonius Zimara</alternateName>
>>             <alternateName>M. Antonius Zimarra</alternateName>
>>             <alternateName>Marc Antoine Zimara</alternateName>
>>             <alternateName>M. Anto. Zimare</alternateName>
>>             <alternateName>Marco A. Zimara</alternateName>
>>             <alternateName>Marcus A. Zimara</alternateName>
>>             <alternateName>Marc Ant. Zimara</alternateName>
>>             <alternateName>Marcantonio Zimara</alternateName>
>>             <alternateName>Marcus Antonius Zimara</alternateName>
>>             <alternateName>Marcus Antonius Zimarra</alternateName>
>>             <alternateName>Marcianto Zimare</alternateName>
>>             <alternateName>Marco Antonio Zimarra</alternateName>
>>             <alternateName>Marco Antonio Zimare</alternateName>
>>             <birthDate>1460</birthDate>
>>             <deathDate>1532</deathDate>
>>             <description>JWO</description>
>>             <sortableName>Zimara, Marcanton </sortableName>
>>         </person>
>>
>> Articel XML:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <TEI xmlns="http://www.tei-c.org/ns/1.0<http://www.tei-c.org/ns/1.0>">
>>     <teiHeader>
>>         <fileDesc>
>>             <titleStmt>
>>                 <title>a nihilo nihil fit</title>
>>                 <author>
>>                     <persName
>> ref="/db/projects/jwo/data/lists/personenListe.xml#BS_d1e509"
>> xml:id="author_BS_d1e509">
>> <forename>Marcanton</forename>
>>                         <surname>Zimara</surname>
>>                     </persName>
>>                 </author>
>>             </titleStmt>
>>             <sourceDesc>
>>                 <p xml:id="p_sourceDesc_igw_tvr_pzb">born digital</p>
>>             </sourceDesc>
>>         </fileDesc>
>>     </teiHeader>
>>     <text xml:lang="de-DE" type="main">
>>         <body>
>>             <div1 xml:id="div1_d1e23_2">
>>                 <p xml:id="p_d1e27_1" n="1"> Lorem ipsum dolor sit
>> amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
>> ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
>> nostrud exercitation <persName xml:id="persName_sa123"
>> ref="https://d-nb.info/gnd/120156784<https://d-nb.info/gnd/120156784>"
>> rend="smallcaps">Zimara</persName> ullamco laboris nisi ut aliquip ex
>> ea commodo consequat. Duis aute irure dolor in reprehenderit in
>> voluptate velit esse cillum dolore eu <persName
>> xml:id="persName_s123"
>> ref="https://d-nb.info/gnd/120156784<https://d-nb.info/gnd/120156784>">Zimara</persName> fugiat nulla
>> pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
>> culpa qui officia deserunt mollit anim id est laborum. </p>
>>             </div1>
>>         </body>
>>     </text>
>> </TEI>
>>
>> Collection.xconf:
>>
>> <collection xmlns="http://exist-db.org/collection-config/1.0<http://exist-db.org/collection-config/1.0>">
>>     <index xmlns:gndo="https://d-nb.info/standards/elementset/gnd#<https://d-nb.info/standards/elementset/gnd#>"
>> xmlns:owl="http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl#>"
>> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>"
>> xmlns:xs="http://www.w3.org/2001/XMLSchema<http://www.w3.org/2001/XMLSchema>">
>>         <lucene>
>>             <module uri="http://place.sok.org/xquery/index-persons<http://place.sok.org/xquery/index-persons>"
>> prefix="ip"
>> at="xmldb:exist:///db/apps/sok-application/modules/index-persons.xqm"<xmldb:exist:///db/apps/sok-application/modules/index-persons.xqm>/>
>>             <analyzer
>> class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
>>             <analyzer
>> class="org.exist.indexing.lucene.analyzers.NoDiacriticsStandardAnalyzer"
>> id="nodiacritics"/>
>>             <text qname="adcache">
>>                 <field name="basicId"
>> expression="//person/@id/string()"/>
>>                 <field name="fullname"
>> expression="string(./basic/person/name)"/>
>>                 <field name="gndURI"
>> expression="string(./basic/person/identifier[@preferred eq 'YES'])"/>
>>                 <field name="gndID"
>> expression="substring-after(./basic/person/identifier[@preferred eq
>> 'YES']/string(), '/gnd/')"/>
>>                 <field name="status"
>> expression="./basic/person/@status/string()<mailto:./basic/person/@status/string()>"/>
>>                 <field name="articleID"
>> expression="ip:getArticleFromPersonCache(.)"<ip:getArticleFromPersonCache(.)>/>
>>                 <field name="articleRole"
>> expression="ip:getArticleRoleFromPersonCache(.)"<ip:getArticleRoleFromPersonCache(.)>/>
>>                 <field name="fulltext" expression="ip:getFullText(.)"<ip:getFullText(.)>/>
>>                </text>
>>         </lucene>
>>     </index>
>> </collection>
>>
>> Module Functions:
>>
>> module namespace ip                 =
>> "http://place.sok.org/xquery/index-persons<http://place.sok.org/xquery/index-persons>";
>>
>> declare namespace basic               =
>> "http://place.sok.org/xquery/basic<http://place.sok.org/xquery/basic>" ;
>>
>> declare namespace xs                = "http://www.w3.org/2001/XMLSchema<http://www.w3.org/2001/XMLSchema>";
>> declare namespace tei               = "http://www.tei-c.org/ns/1.0<http://www.tei-c.org/ns/1.0>";
>> declare namespace util              = "http://exist-db.org/xquery/util<http://exist-db.org/xquery/util>";
>>
>> declare function ip:getArticleFromPersonCache($adcache as element())
>> as xs:string* {
>>     let $parentCollectionPath as xs:anyURI? :=
>> ip:getParentCollection($adcache),
>>         $basicId as xs:string := $adcache/basic/person/@id/string(),
>>         $identifier as xs:string? :=
>> $adcache/basic/person/identifier[@preferred eq 'YES']/string(),
>>         $listId as xs:string? := collection(
>> $variables:jwo-lists-path )/tei:TEI//tei:person[
>> basic:basic-id-from-url(string(@sameAs)) eq $basicId]/@xml:id/string(),
>>         $foundInDocumentIds as xs:string* :=
>>             if ( matches($parentCollectionPath,'prepublish') )
>>             then
>>             (
>> collection($parentCollectionPath)/tei:TEI[./tei:teiHeader//tei:idno[1]/string()
>> ne
>> ''][matches(replace((normalize-space('||'||string-join(distinct-values(.//tei:persName[@ref]/@ref/string())
>> ! replace(.,'.*?#',''),
>> '||')||'||')||normalize-space('||'||string-join(distinct-values(.//tei:persName[@source]/@source/string())
>> ! replace(.,'.*?#',''),
>> '||')||'||')),'\|{4}',''),'\|{2}('||$basicId||'|'||$listId||'|'||$identifier||')\|{2}')]//tei:idno/string()
>>             )
>>             else
>>             (
>> collection($parentCollectionPath)/tei:TEI[./tei:teiHeader//tei:idno[1]/string()
>> ne
>> ''][matches(replace((normalize-space('||'||string-join(distinct-values(.//tei:persName[@ref][not(parent::editor)]/@ref/string())
>> ! replace(.,'.*?#',''),
>> '||')||'||')||normalize-space('||'||string-join(distinct-values(.//tei:persName[@source][not(parent::editor)]/@source/string())
>> ! replace(.,'.*?#',''),
>> '||')||'||')),'\|{4}',''),'\|{2}('||$basicId||'|'||$listId||'|'||$identifier||')\|{2}')]//tei:idno/string()
>>             )
>>     return
>>     (
>>         $foundInDocumentIds
>>     )
>> };
>>
>> declare
>> function ip:getAuthenticatedArticleCollection($collection-name as
>> xs:string) as item()* {
>>     if ($collection-name eq 'prepublish') then
>> xmldb:xcollection($variables:jwo-prepublish-path) else
>> xmldb:xcollection($variables:jwo-publish-path)
>> };
>>
>> declare function
>> ip:getPersNamesInCollectionFromCachedPerson($cached-person as
>> element(), $collection-name as xs:string) as element()* {
>>     let $basicId := $cached-person/basic/person/@id/string()
>>     let $identifier :=
>> $cached-person/basic/person/identifier[@preferred eq 'YES']/string()
>>     let $listId := collection( $variables:jwo-lists-path
>> )/tei:TEI/tei:text[1]/tei:body[1]/tei:listPerson[1]/tei:person[
>> basic:basic-id-from-url(string(@sameAs)) eq $basicId]/@xml:id/string()
>>     let $collection
>> := ip:getAuthenticatedArticleCollection($collection-name)
>>     return (
>>         $collection//tei:persName[
>>             string(@ref) eq $identifier
>>             or ip:getIdFromUri(string(@ref)) eq $listId
>>             or substring-after(string(@ref), '#') eq $basicId
>>             or substring-before(substring-after(string(@source),
>> 'persons/'), '.xml') eq $basicId]
>>     )
>> };
>>
>> declare function ip:getRoleFromPersName($persName as element(),
>> $collection-name as xs:string) as xs:string? {
>>     if ($persName/ancestor::*/local-name() = 'author')
>>     then ( 'author' )
>>     else if ($persName/ancestor::*/local-name() = 'editor')
>>     then (
>>         if ($collection-name eq 'prepublish') then ( 'editor' )
>>         (: Ignore editors in published case :)
>>         else ()
>>     )
>>     else ( 'annotated' )
>> };
>>
>> declare function ip:getArticleRoleFromPersonCache($cached-person as
>> element(), $collection-name as xs:string) as xs:string* {
>>     let $allPersNames :=
>>         if ($collection-name ne 'prepublish') then (
>> ip:getPersNamesInCollectionFromCachedPerson($cached-person,
>> $collection-name)[not(ancestor::*/local-name() = 'editor')]
>>         ) else (
>> ip:getPersNamesInCollectionFromCachedPerson($cached-person,
>> $collection-name)
>>         )
>>     return (
>>         for $articleGroup in $allPersNames
>>         let $articleID := $articleGroup/ancestor::tei:TEI//tei:idno[1]
>>         group by $articleID
>>         return (
>>             $articleID || '@@' || string-join(distinct-values(
>>                 for $persName in $articleGroup
>>                 let $role := ip:getRoleFromPersName($persName,
>> $collection-name)
>>                 order by $role
>>                 return $role
>>             ), ' ')
>>         )
>>     )
>> };
>>
>> declare function ip:getParentCollection($element as node()) as
>> xs:anyURI? {
>>     resolve-uri('../../', $element/base-uri())
>> };
>>
>> declare function ip:getIdFromUri($uri as xs:string) as xs:string {
>>     substring-after($uri, '#')
>> };
>>
>> declare function basic:basic-id-from-url($url as xs:string) as
>> xs:string? {
>>     substring-after(substring-before($url, '?dataset'),'persons/')
>> };
>>
>> declare function ip:getFullText($element) as xs:string {
>>     let $parentCollection as xs:anyURI? :=
>> ip:getParentCollection($element)
>>     return
>>     (
>>        normalize-space(string-join(
>>                     let $basicId as xs:string :=
>> $element/basic/person/@id/string(),
>>                         $identifier as xs:string* :=
>> $element/basic/person/identifier[@preferred eq 'YES']/string(),
>>                         $listId as item()* := collection(
>> $variables:lists-path
>> )/tei:TEI/tei:text[1]/tei:body[1]/tei:listPerson[1]/tei:person[
>> basic:basic-id-from-url(string(@sameAs)) eq $basicId]/@xml:id/string(),
>>                         $element-string as xs:string* :=
>> string($element),
>>                         $collections as item()* :=
>> collection($parentCollection)//tei:persName[string(@ref) eq
>> $identifier or ip:getIdFromUri(string(@ref)) eq $listId or
>> substring-after(string(@ref), '#') eq $basicId or
>> substring-before(substring-after(string(@source), 'persons/'),
>> '.xml') eq $basicId][1],
>>                         $element-cache-string as xs:string* :=
>> string-join(for $found-element in $collections where
>> count($found-element) > 0 return $found-element, ' ')
>>                     return
>>                     (
>>                         $element-string,$element-cache-string
>>                     ),' '))
>>     )
>> };
>>
>> Please help.
>>
>>
>
>
> _______________________________________________
> Exist-open mailing list
> Exi...@li...<mailto:Exi...@li...>
> https://lists.sourceforge.net/lists/listinfo/exist-open<https://lists.sourceforge.net/lists/listinfo/exist-open>

--
Lars Scheideler
- wiss. technischer Mitarbeiter -
Althochdeutsches Wörterbuch & Digital Humanities

Sächsische Akademie der Wissenschaften zu Leipzig
Karl-Tauchnitz-Straße 1
04107 Leipzig

sch...@sa...<mailto:sch...@sa...>
www.saw-leipzig.de<http://www.saw-leipzig.de>

_______________________________________________
Exist-open mailing list
Exi...@li...<mailto:Exi...@li...>
https://lists.sourceforge.net/lists/listinfo/exist-open<https://lists.sourceforge.net/lists/listinfo/exist-open>

--
Cu stimă,
Claudius Teodorescu

--

Lars Scheideler

- wiss. technischer Mitarbeiter -

Althochdeutsches Wörterbuch & Digital Humanities

Sächsische Akademie der Wissenschaften zu Leipzig

Karl-Tauchnitz-Straße 1

04107 Leipzig

sch...@sa...<mailto:sch...@sa...>

www.saw-leipzig.de<http://www.saw-leipzig.de>

Re: [Exist-open] Fwd: Cross reference Index - why is so slow

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Fwd: Cross reference Index - why is so slow