|
From: Lars S. <sch...@sa...> - 2024-09-03 11:46:49
|
Hey Claudius, thank you for the insight. But with the solution you presented, I ask myself why should I use eXistDb at all? Then I could simply use an XML parser and read the fields directly from the files at file level or use another document-based DB that can do XML. But I want to use the features that are in a complete system - I think the diversions with a custom search engine for combined fields is somehow good, but also thought around the corner. Surely there must be a solution to the problem in eXistdb? Best regards Lars Am 03.09.24 um 11:27 schrieb Claudius Teodorescu: > Dear Lars, > > My name is Claudius Teodorescu, and I am currently working for the > Academy of Mainz. > > For the cases with publishing dictionaries (around 8, see > https://clre.solirom.ro), I have chosen the static website approach, > along with static indexes, and a static search engine that is > browser-based and is written in Rust and compiled to WebAssembly. > > One can see some test data from the DWDS dictionary (around 250,000 > entries, and four indexes) at > https://claudius-teodorescu.gitlab.io/dwds-site/. > > Maybe this is of any help for you. > > I think that I do not have to mention that the indexing takes seconds, > as I have written the indexing engine in Rust language. :) > > Best regards, > Claudius > > On Tue, 3 Sept 2024 at 11:30, Lars Scheidelerl > <sch...@sa...> wrote: > > Dear Boris, > > thank you very much for the answer. > > The person data and the article data are in different collections. > For every person we have one file, for every article we have one file. > > We don't have a fixed order of when data is imported or written in, > as I understand it, what you are suggesting would make it > necessary for > all the person or item data to > always be available before the index of the other collection, so > that it > can be accessed in the index of the other. > > Maybe we have to, like you pointed out, tests the module functions > for > better performance. > But in the past however, after several different approaches that > we have > already tried, we have realized that querying the data, > no matter how good it is in eXide, for example, is significantly > slower > when it is used for the index. > Could it be that the structural index is not used when re-indexing? > Our assumption was that the data is iterated over differently or that > new blobs are written over the structure and content in addition to > indexing. > In other words, that it validates, saves and indexes the data at the > same time as re-indexing. > > Other approaches include the use of cache or a helper file, where the > fields are composed beforehand and then indexed accordingly, > but this also takes a long time and unfortunately also blocks > working on > the files. > So if we write a helper file with all fields in eXide, the whole > process > takes about 90s, if we work with the fields as we would build the > index, > i.e. with xml:id in the helper file, ~110s. > Re-Index 2-4 hours. Not really understandable. > > We are now primarily trying to improve this using the xml:id/id() > function, but not much hope to improve the re-index on a > production scale. > But if all data is re-indexed, i.e. the xml:id fields are not > available, > it is in vain, too. > > Would love to learn more about this topic and continue to share > experiences. > > Lars > > Am 03.09.24 um 00:49 schrieb Boris Lehečka: > > Dear Lars, > > > > I have similar issues with indexing dictionaries: my indexing > > procedure ask for data from the taxonomy (like expansions of > > abbreviations) and sometimes indexing a dictionary with about > 36,000 > > entries took a whole day. > > > > I don't remember who (Juri Leino, I guess), pointed out to > me that > > the index is saved only after the whole document is parsed. After > > moving each dictionary entry to a separate file, indexing took much > > less time (several hours). > > > > However, this does not seem to be the cause of your problem. > > > > In my opinion, your indexing code (in the module) is very > > complicated, sometimes it can be much simpler, for example without > > explicit conversion to string (like in tei:persName[string(@ref) eq > > $identifier ...]), or normalizing spaces in ip:getFullText > (full-tex t > > search usually uses only the parts between spaces). > > > > My suggestion is following: first, create an index for > persons in > > a separate collection (with separate collection.xconf): compute > fields > > with values you will query or want to return when you index the > > articles (in different collection). And second, use Lucene and > > fulltext search in your "index-persons" module to find data in the > > index from the first phase. > > > > This is just an idea, not tested, I hope someone else is much > more > > experienced in the magic of indexing. > > > > Best, > > > > Boris Lehečka > > > > Dne 02.09.2024 v 16:57 Lars Scheidelerl napsal(a): > >> Hey, > >> > >> we assume that we are not using the index in our project as > intended. > >> Because when we try to build the index we have created, it takes a > >> very long time. > >> > >> We have two collections, one with a data stack of 687 in which > data > >> is stored, and one with 400 xml where articles are are stored. > >> > >> For the personal article we want certain information from the > >> articles and vice versa. > >> > >> Person XML: > >> > >> <person xml:id="i0c9ab7e2-2e21-39ff-aea8-c56ad4702a7f" > status="safe" > >> modified="2024-07-30T13:26:09.154+02:00"> > >> <name>Marcanton Zimara</name> > >> <identifier > >> preferred="YES">https://d-nb.info/gnd/120156784</identifier> > >> <alternateName>Marcusantonius Zimara</alternateName> > >> <alternateName>Marcus Anthonius Zimara</alternateName> > >> <alternateName>Antonius Zimara</alternateName> > >> <alternateName>M. Antonius Zimarra</alternateName> > >> <alternateName>Marc Antoine Zimara</alternateName> > >> <alternateName>M. Anto. Zimare</alternateName> > >> <alternateName>Marco A. Zimara</alternateName> > >> <alternateName>Marcus A. Zimara</alternateName> > >> <alternateName>Marc Ant. Zimara</alternateName> > >> <alternateName>Marcantonio Zimara</alternateName> > >> <alternateName>Marcus Antonius Zimara</alternateName> > >> <alternateName>Marcus Antonius Zimarra</alternateName> > >> <alternateName>Marcianto Zimare</alternateName> > >> <alternateName>Marco Antonio Zimarra</alternateName> > >> <alternateName>Marco Antonio Zimare</alternateName> > >> <birthDate>1460</birthDate> > >> <deathDate>1532</deathDate> > >> <description>JWO</description> > >> <sortableName>Zimara, Marcanton </sortableName> > >> </person> > >> > >> Articel XML: > >> > >> <?xml version="1.0" encoding="UTF-8"?> > >> <TEI xmlns="http://www.tei-c.org/ns/1.0"> > >> <teiHeader> > >> <fileDesc> > >> <titleStmt> > >> <title>a nihilo nihil fit</title> > >> <author> > >> <persName > >> ref="/db/projects/jwo/data/lists/personenListe.xml#BS_d1e509" > >> xml:id="author_BS_d1e509"> > >> <forename>Marcanton</forename> > >> <surname>Zimara</surname> > >> </persName> > >> </author> > >> </titleStmt> > >> <sourceDesc> > >> <p xml:id="p_sourceDesc_igw_tvr_pzb">born > digital</p> > >> </sourceDesc> > >> </fileDesc> > >> </teiHeader> > >> <text xml:lang="de-DE" type="main"> > >> <body> > >> <div1 xml:id="div1_d1e23_2"> > >> <p xml:id="p_d1e27_1" n="1"> Lorem ipsum dolor sit > >> amet, consectetur adipiscing elit, sed do eiusmod tempor > incididunt > >> ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis > >> nostrud exercitation <persName xml:id="persName_sa123" > >> ref="https://d-nb.info/gnd/120156784" > >> rend="smallcaps">Zimara</persName> ullamco laboris nisi ut > aliquip ex > >> ea commodo consequat. Duis aute irure dolor in reprehenderit in > >> voluptate velit esse cillum dolore eu <persName > >> xml:id="persName_s123" > >> ref="https://d-nb.info/gnd/120156784">Zimara</persName> fugiat > nulla > >> pariatur. Excepteur sint occaecat cupidatat non proident, sunt in > >> culpa qui officia deserunt mollit anim id est laborum. </p> > >> </div1> > >> </body> > >> </text> > >> </TEI> > >> > >> Collection.xconf: > >> > >> <collection xmlns="http://exist-db.org/collection-config/1.0"> > >> <index > xmlns:gndo="https://d-nb.info/standards/elementset/gnd#" > >> xmlns:owl="http://www.w3.org/2002/07/owl#" > >> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" > >> xmlns:xs="http://www.w3.org/2001/XMLSchema"> > >> <lucene> > >> <module > uri="http://place.sok.org/xquery/index-persons" > >> prefix="ip" > >> > at="xmldb:exist:///db/apps/sok-application/modules/index-persons.xqm"/> > >> <analyzer > >> class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> > >> <analyzer > >> > class="org.exist.indexing.lucene.analyzers.NoDiacriticsStandardAnalyzer" > > >> id="nodiacritics"/> > >> <text qname="adcache"> > >> <field name="basicId" > >> expression="//person/@id/string()"/> > >> <field name="fullname" > >> expression="string(./basic/person/name)"/> > >> <field name="gndURI" > >> expression="string(./basic/person/identifier[@preferred eq > 'YES'])"/> > >> <field name="gndID" > >> > expression="substring-after(./basic/person/identifier[@preferred eq > >> 'YES']/string(), '/gnd/')"/> > >> <field name="status" > >> expression="./basic/person/@status/string()"/> > >> <field name="articleID" > >> expression="ip:getArticleFromPersonCache(.)"/> > >> <field name="articleRole" > >> expression="ip:getArticleRoleFromPersonCache(.)"/> > >> <field name="fulltext" > expression="ip:getFullText(.)"/> > >> </text> > >> </lucene> > >> </index> > >> </collection> > >> > >> Module Functions: > >> > >> module namespace ip = > >> "http://place.sok.org/xquery/index-persons"; > >> > >> declare namespace basic = > >> "http://place.sok.org/xquery/basic" ; > >> > >> declare namespace xs = > "http://www.w3.org/2001/XMLSchema"; > >> declare namespace tei = > "http://www.tei-c.org/ns/1.0"; > >> declare namespace util = > "http://exist-db.org/xquery/util"; > >> > >> declare function ip:getArticleFromPersonCache($adcache as > element()) > >> as xs:string* { > >> let $parentCollectionPath as xs:anyURI? := > >> ip:getParentCollection($adcache), > >> $basicId as xs:string := > $adcache/basic/person/@id/string(), > >> $identifier as xs:string? := > >> $adcache/basic/person/identifier[@preferred eq 'YES']/string(), > >> $listId as xs:string? := collection( > >> $variables:jwo-lists-path )/tei:TEI//tei:person[ > >> basic:basic-id-from-url(string(@sameAs)) eq > $basicId]/@xml:id/string(), > >> $foundInDocumentIds as xs:string* := > >> if ( matches($parentCollectionPath,'prepublish') ) > >> then > >> ( > >> > collection($parentCollectionPath)/tei:TEI[./tei:teiHeader//tei:idno[1]/string() > > >> ne > >> > ''][matches(replace((normalize-space('||'||string-join(distinct-values(.//tei:persName[@ref]/@ref/string()) > > >> ! replace(.,'.*?#',''), > >> > '||')||'||')||normalize-space('||'||string-join(distinct-values(.//tei:persName[@source]/@source/string()) > > >> ! replace(.,'.*?#',''), > >> > '||')||'||')),'\|{4}',''),'\|{2}('||$basicId||'|'||$listId||'|'||$identifier||')\|{2}')]//tei:idno/string() > >> ) > >> else > >> ( > >> > collection($parentCollectionPath)/tei:TEI[./tei:teiHeader//tei:idno[1]/string() > > >> ne > >> > ''][matches(replace((normalize-space('||'||string-join(distinct-values(.//tei:persName[@ref][not(parent::editor)]/@ref/string()) > > >> ! replace(.,'.*?#',''), > >> > '||')||'||')||normalize-space('||'||string-join(distinct-values(.//tei:persName[@source][not(parent::editor)]/@source/string()) > > >> ! replace(.,'.*?#',''), > >> > '||')||'||')),'\|{4}',''),'\|{2}('||$basicId||'|'||$listId||'|'||$identifier||')\|{2}')]//tei:idno/string() > >> ) > >> return > >> ( > >> $foundInDocumentIds > >> ) > >> }; > >> > >> declare > >> function ip:getAuthenticatedArticleCollection($collection-name as > >> xs:string) as item()* { > >> if ($collection-name eq 'prepublish') then > >> xmldb:xcollection($variables:jwo-prepublish-path) else > >> xmldb:xcollection($variables:jwo-publish-path) > >> }; > >> > >> declare function > >> ip:getPersNamesInCollectionFromCachedPerson($cached-person as > >> element(), $collection-name as xs:string) as element()* { > >> let $basicId := $cached-person/basic/person/@id/string() > >> let $identifier := > >> $cached-person/basic/person/identifier[@preferred eq > 'YES']/string() > >> let $listId := collection( $variables:jwo-lists-path > >> )/tei:TEI/tei:text[1]/tei:body[1]/tei:listPerson[1]/tei:person[ > >> basic:basic-id-from-url(string(@sameAs)) eq > $basicId]/@xml:id/string() > >> let $collection > >> := ip:getAuthenticatedArticleCollection($collection-name) > >> return ( > >> $collection//tei:persName[ > >> string(@ref) eq $identifier > >> or ip:getIdFromUri(string(@ref)) eq $listId > >> or substring-after(string(@ref), '#') eq $basicId > >> or substring-before(substring-after(string(@source), > >> 'persons/'), '.xml') eq $basicId] > >> ) > >> }; > >> > >> declare function ip:getRoleFromPersName($persName as element(), > >> $collection-name as xs:string) as xs:string? { > >> if ($persName/ancestor::*/local-name() = 'author') > >> then ( 'author' ) > >> else if ($persName/ancestor::*/local-name() = 'editor') > >> then ( > >> if ($collection-name eq 'prepublish') then ( 'editor' ) > >> (: Ignore editors in published case :) > >> else () > >> ) > >> else ( 'annotated' ) > >> }; > >> > >> declare function > ip:getArticleRoleFromPersonCache($cached-person as > >> element(), $collection-name as xs:string) as xs:string* { > >> let $allPersNames := > >> if ($collection-name ne 'prepublish') then ( > >> ip:getPersNamesInCollectionFromCachedPerson($cached-person, > >> $collection-name)[not(ancestor::*/local-name() = 'editor')] > >> ) else ( > >> ip:getPersNamesInCollectionFromCachedPerson($cached-person, > >> $collection-name) > >> ) > >> return ( > >> for $articleGroup in $allPersNames > >> let $articleID := > $articleGroup/ancestor::tei:TEI//tei:idno[1] > >> group by $articleID > >> return ( > >> $articleID || '@@' || string-join(distinct-values( > >> for $persName in $articleGroup > >> let $role := ip:getRoleFromPersName($persName, > >> $collection-name) > >> order by $role > >> return $role > >> ), ' ') > >> ) > >> ) > >> }; > >> > >> declare function ip:getParentCollection($element as node()) as > >> xs:anyURI? { > >> resolve-uri('../../', $element/base-uri()) > >> }; > >> > >> declare function ip:getIdFromUri($uri as xs:string) as xs:string { > >> substring-after($uri, '#') > >> }; > >> > >> declare function basic:basic-id-from-url($url as xs:string) as > >> xs:string? { > >> substring-after(substring-before($url, '?dataset'),'persons/') > >> }; > >> > >> declare function ip:getFullText($element) as xs:string { > >> let $parentCollection as xs:anyURI? := > >> ip:getParentCollection($element) > >> return > >> ( > >> normalize-space(string-join( > >> let $basicId as xs:string := > >> $element/basic/person/@id/string(), > >> $identifier as xs:string* := > >> $element/basic/person/identifier[@preferred eq 'YES']/string(), > >> $listId as item()* := collection( > >> $variables:lists-path > >> )/tei:TEI/tei:text[1]/tei:body[1]/tei:listPerson[1]/tei:person[ > >> basic:basic-id-from-url(string(@sameAs)) eq > $basicId]/@xml:id/string(), > >> $element-string as xs:string* := > >> string($element), > >> $collections as item()* := > >> collection($parentCollection)//tei:persName[string(@ref) eq > >> $identifier or ip:getIdFromUri(string(@ref)) eq $listId or > >> substring-after(string(@ref), '#') eq $basicId or > >> substring-before(substring-after(string(@source), 'persons/'), > >> '.xml') eq $basicId][1], > >> $element-cache-string as xs:string* := > >> string-join(for $found-element in $collections where > >> count($found-element) > 0 return $found-element, ' ') > >> return > >> ( > >> $element-string,$element-cache-string > >> ),' ')) > >> ) > >> }; > >> > >> Please help. > >> > >> > > > > > > _______________________________________________ > > Exist-open mailing list > > Exi...@li... > > https://lists.sourceforge.net/lists/listinfo/exist-open > > -- > Lars Scheideler > - wiss. technischer Mitarbeiter - > Althochdeutsches Wörterbuch & Digital Humanities > > Sächsische Akademie der Wissenschaften zu Leipzig > Karl-Tauchnitz-Straße 1 > 04107 Leipzig > > sch...@sa... > www.saw-leipzig.de <http://www.saw-leipzig.de> > > > > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > > > > -- > Cu stimă, > Claudius Teodorescu -- Lars Scheideler - wiss. technischer Mitarbeiter - Althochdeutsches Wörterbuch & Digital Humanities Sächsische Akademie der Wissenschaften zu Leipzig Karl-Tauchnitz-Straße 1 04107 Leipzig sch...@sa... www.saw-leipzig.de |