From: Kenneth R. B. <krb...@gm...> - 2011-04-29 23:31:29
|
Newbie: splitting a dictionary into individual <entry> files, almost working. I've got a little problem with collection names and Unicode. Background: I've got eXist 1.4.0 installed successfully on OS X 10.6.7. Hoping to make a little XRX application. I'm trying to follow the instructions in "A Step-by-Step Introduction to XRX with eXist". Context: The problem came up when trying to split a large monolithic XML dictionary file of the following structure <?xml version="1.0" encoding="UTF-8"?> <hd> <head>...</head> <body> <letters> <letter name="A"> <entry>...</entry> <entry>...</entry> ... </letter> <letter name="E"> <entry>...</entry> <entry>...</entry> ... </letter> <letter name="Ö"> <!-- N.B. the O with diaeresis here --> <entry>...</entry> <entry>...</entry> ... <!-- with more entry elements for the remaining letters --> </letter> ... </letters> </body> </hd> into individual <entry> files. In the original big XML file, the entries starting with the same letter X are grouped into a <letter name="X">...</letter> element. I want to keep the entries for each letter in a separate sub-collection. Why: The dictionary has over 32,000 entries. It seemed like a good idea to split them into individual entry files, but still grouped in multiple letter-based sub-collections, for 1) ease of future editing and 2) to provide manageable listing of entries (using a modification of list-items.xq that lists the entries starting with a specified letter) Collection structure: In the eXist collection, I manually created the following structure /db/ apps/ hd/ data/ A/ E/ I/ Ö/ etc. for all the other letters with one data/ subcollection for each letter. I created the data/ sub-collections A/, E/, etc. manually by going into http://localhost:8080/exist and selecting "Admin" and "Browse Collections", then "CreateCollection". No apparent problems; but to type the Ö (Unicode LATIN CAPITAL LETTER O WITH DIAERESIS), I had to use the OS X Unicode input method. The code point value is U+00D6. (The XML file is in UTF-8.) I then used the following splitting script, with worked well (up to a point): xquery version "1.0" ; let $input-document := '/db/apps/hd/xml/hd_final.xml' return <SplitResults>{ for $letter in doc($input-document)/hd/body/letters/letter let $lettername := $letter/@name (: A, E, H, I, K,L,M.N ... V, W, Y :) let $subcollection := concat('/db/apps/hd/data/', $lettername) let $output-collection := xmldb:login($subcollection, 'admin', 'myAdminPasswordHere') return for $entry in $letter/entry (: the unique id for each entry is the 'alph' attribute :) let $id := $entry/@alph let $docname := concat($id, '.xml') let $store-return := xmldb:store($subcollection, $docname, $entry) return <stored>{$id}</stored> }</SplitResults> and launched it from a browser: http://localhost:8080/exist/rest/db/apps/hd/split/split.xq It started working, but it choked when it got to the <letter name="Ö">...</letter> element, complaining that there was not a /db/ apps/ hd/ data/ Ö/ collection to store the <entry>s in. Here's the actual error message: Could not locate collection: /db/apps/hd/data/Ö [at line 17, column 46] (It's complaining at line let $store-return := xmldb:store($subcollection, $docname, $entry) Result: the splitting worked for other letters (e.g. the subcollection data/E/ is full of new individual entry files for words starting with E), but the script choked on Ö. Questions Question 1: Can collection names in eXist contain Unicode characters beyond the ASCII range? Question 2: Is this a Unicode encoding problem? Perhaps the Ö is treated as UTF-8 in one environment/program but not in another? Question 3: In the split.sq script, is there a way to create the subdirectories under data/ programmatically (as opposed to manually creating them, as I did). If I create the data/ subcollections inside the script, then the encoding of Ö is likely to be consistent. Thanks, Ken ****************************** Kenneth R. Beesley, D.Phil. P.O. Box 540475 North Salt Lake, UT 84054 USA |