This Wiki article concerns X-definition's support for Xpath 3.1. We'll need to proceed just a little differently than we have elsewhere in the Wiki, where we could afford to rely entirely on the X-definition user documentation's explanations when considering whether or not our code would likely work.
As we had when we learned how to call Java language methods from X-script, we'll require the classpath (-cp
) option when starting X-definition. Particularly because it is not new (in this project's context, at least), it is the less significant concern out of two, the other of which is completely to do with XML namespaces and is best addressed first.
The XML source files used in this section of the Wiki are those available from Columbia University Libraries. The preceding link is the general introduction, and the download page is here. We've used files that deployed the very same XML vocabulary already during the latter parts of the tutorial. What was different was that practically the sole purpose had amounted to demonstrating that X-definition shall run as expected irrespective of the extent of its input as measured in bytes. The files from Germany's national library were pretty enormous and had by and large been chosen in order to make our point; but the files from the American college that we'll use here are much smaller. The decompressed size of the file that is currently the largest is something under 2 gigabytes, and the one that currently is smallest has a decompressed size of about 190 megabytes.
On pages 17 and 18 of the tutorial, I took note of a peculiarity of the MARC21-XML document that was under consideration, which that its top-level element declared a namespace where the XML file used in the preceding sections had not. When we next encountered top-level namespace declarations, in this Wiki's general introduction and also in the article on calling Java methods from X-script, we declared them just as they had occurred in our source XML documents and found that our code ran unhindered. The XML xmlns
attribute can be used at most once on a tag absent a qualifier, and the authors of the XML that was at our disposal had declined to use it in that manner at all. As we have seen, and will see in perpetuity, with the MARC21-XML documents that is not so.
Once that you consider the responses to the question on Stack Overflow that was posted here, I think that you will find yourself in as good a position as anyone to consider the practical implications of using the xmlns
attribute to declare a namespace that descendant elements silently "inherit" (what is brought to light can depend on whether your application produces markup), and perhaps the implications of doing so in a part of the document where it will not necessarily appear as though it in some way stands out. (Here, it looked like it could have been done partly with a view to validating the XML document's data should an application require it). I have the feeling that it will prove sufficient to point out that the following runs only partly as expected:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="marc" root="collection" xmlns="http://www.loc.gov/MARC21/slim"> <xd:declaration> </xd:declaration> <collection> <record xd:script="occurs +;options ignoreOther;init{outln(xpath('\'oh no!\' => substring-before(\'!\')'));}forget;"> <leader xd:script="occurs +"> string; </leader> <controlfield tag="string;"> string; </controlfield> finally outln(xpath("controlfield[@tag='001']/text()")); </record> </collection> </xd:def>
(If you wish to run it [I've named the file invisible.xdef
], attending to the classpath will be necessary [more on which later, as promised]):
java -cp xdef-beginner.jar:/home/curt/SaxonHE10-3J/saxon-he-10.3.jar xdef invisible.xdef Columbia-extract-20210630-020.xml
The Xpath expression that depended on a literal (\'oh no!\' => substring-before(\'!\')
, adapted from an example found here) evaluated to a sequence of items as anticipated, but the expression that required data from the source XML in order to build one (controlfield[@tag='001']/text()
) did not.
This code is necessary instead:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="marc" root="ls:collection" xmlns:ls="http://www.loc.gov/MARC21/slim"> <xd:declaration> </xd:declaration> <ls:collection> <ls:record xd:script="occurs +;options ignoreOther;init{outln(xpath('\'oh no!\' => substring-before(\'!\')'));}forget;"> <ls:leader xd:script="occurs +"> string; </ls:leader> <ls:controlfield tag="string;"> string; </ls:controlfield> finally outln(xpath("ls:controlfield[@tag='001']/text()")); </ls:record> </ls:collection> </xd:def>
The expedient to which we have resorted (to wit, a prefix [ls
] assigned on an ad hoc, one-off basis, rather than assigned consequent on ordinary assumptions about XML authoring) is effective because the descendant elements all inherited the Uniform Resource Identifier (URI) no matter what. Our applications already have us covered inasmuch as every last element prospectively is in a namespace. For bad or for worse, our share of the work is to assign a binding on which our applications will subsequently premise one-to-many relationships as needed. If you would like, look at our source XML as having availed itself of a feature: the authors were able to economize on bytes by limiting the elements' identifiers to local names, yet remained in a position to furnish a unique value (the URI) on which other authors coding in XML to similar ends could fall back in case that it appeared advantageous that two or more vocabularies overlap.
Before moving along, I think it is a good idea to point out that if you attempt to run an XSL transformation without even declaring the namespace at the top level of your stylesheet as you found it declared in the source MARC21-XML, the result will at the very least appear not to be remotely like what you might expect although there shall be output. If you do similarly in an X-definition (.xdef
) file, X-definition will hang up and pressing CTRL
+ C
will be necessary in order to turn off the Java Virtual Machine and thereby stop the application.
Furthermore, in either XSLT or X-definition you will need to take advantage of qualified names as described a moment ago for as long as we're on MARC21-XML (as we are here). In XSLT, you don't necessarily have to do from the top on down: it's needed only as soon as your Xpath expressions require what in the source document had occurred just like local names. I got away with doing this:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"> <xsl:output method="text"/> <xsl:template match="/"> <xsl:for-each xmlns:ls="http://www.loc.gov/MARC21/slim" select="//ls:record"> <xsl:text>ok </xsl:text> </xsl:for-each> </xsl:template> </xsl:stylesheet>
That shall not prove workable when there's an xd:def
element in play rather than an xsl:stylesheet
element. You can take advantage of qualified names in the same manner as the above code, but the top-level xd:def
tag is, one way or another, going to require you to furnish it the URI (http://www.loc.gov/MARC21/slim
) found in your source XML. Again, X-definition will hang up unless you do; but you can put off repeating the URI and binding it to a prefix until the code you are writing is about to rely on you to ensure that it can evaluate your Xpath expressions meaningfully.
We can now revisit the Java classpath. In order to use Xpath 3.1, the Saxon XSLT processor absolutely must be on the classpath. Earlier, I demonstrated that when I furnished a command required to run code that confirmed that Xpath 3.1 was supported. The following could be more appropriate:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="dummy" root="code"> <xd:declaration> external method { boolean org.xdef.xml.KXpathExpr.isXPath2(); } </xd:declaration> <code xd:script="occurs +;finally outln(isXPath2())"/> </xd:def>
Where we tried earlier to learn whether Xpath 3.1 was supported by inviting it to evaluate an expression that included a recent addition (the arrow =>
operator), we do so above by calling the X-definition API. I've named the file dummy.xdef
, and running it takes nothing more than this:
echo "<code/>" | java -cp .:xdef-beginner.jar:/home/curt/SaxonHE10-3J/saxon-he-10.3.jar xdef dummy.xdef /dev/stdin
It ought to print true
provided that nothing went wrong. Assuming it has worked, I still think one can do worse than play it safe and test the API call by removing Saxon from the classpath:
echo "<code/>" | java -cp .:xdef-beginner.jar xdef dummy.xdef /dev/stdin
When I ran that, false
was printed rather than true
.
The following initializer block from one of the files (org.xdef.xml.KXpathExpr
) comprising X-definition's source code is what looks for Xpath on your system:
static { XPathFactory x; try { Class<?> cls = Class.forName("net.sf.saxon.xpath.XPathFactoryImpl"); x = (XPathFactory) cls.getConstructor().newInstance(); } catch (Exception ex) { x = null; } catch (Error ex) { x = null; } XPF = (x == null) ? XPathFactory.newInstance() : x; XP2 = x != null; }
You can read about the background to code like the above on the Saxonica website.
A glance at the articles of X-definition code introduced up until this point could suggest that when it suits you, you can use X-definition's API as a front end of sorts if you plan on studying Xpath with a view to going a little beyond things like axis specifiers and path operators. It is my understanding that .NET and C# only really support Xpath 1.0 and certain proprietary extensions, and that migrating potentially can entail something of an effort given only how attenuated Xpath 3.1 is from Xpath's beginnings in terms of years. You could already have used an XSLT stylesheet as a front end to Xpath in order to practice the syntax and to meet the more recent keywords, operators, and functions. Where X-definition is concerned, the code you were shown earlier ostensibly to that end unfortunately proves a little facile.
Instead, something more like this is in order:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="frontend" root="code"> <xd:declaration> String myxpathexpression = "let $b := 'away I go' return (0 to string-length($b)) ! (' ' || substring($b,1,string-length($b) - .))"; String valueofmyxpathexpression = "serialize(" + myxpathexpression + ")"; </xd:declaration> <code xd:script="occurs +;init {outln(xpath(valueofmyxpathexpression));}"/> <!--<code xd:script="occurs +;init {outln(xpath(myxpathexpression));}"/>--> </xd:def>
But it leaves the door open to coding that isn't very good. Note the link to Stack Overflow embedded as a comment in the following code, which one way or another is an improvement:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="frontend" root="code"> <xd:declaration> <!-- https://stackoverflow.com/questions/68171810/does-x-definition-possess-an-instruction-similar-to-xslvalue-of-in-xslt --> String myxpathexpression = "string-join(let $b := 'away I go' return (0 to string-length($b)) ! (' ' || substring($b,1,string-length($b) - .)), ' ')"; </xd:declaration> <code xd:script="occurs +;init {outln(xpath(myxpathexpression));}"/> </xd:def>
You can run either of the preceding examples this way (I'm using the filename frontend.xdef
):
echo "<code/>" | java -cp .:xdef-beginner.jar:/home/curt/SaxonHE10-3J/saxon-he-10.3.jar xdef frontend.xdef /dev/stdin
What has bearing on the results we find (they are the same, as long as the commenting in the first example isn't reversed) is that the xpath()
function only evaluates an Xpath expression in order to populate a Container
, which is putatively X-definition's version of what is known in Xpath as a sequence. The function itself will proceed to populate the Container
only as long as it's getting the data from what in the X-definition model is referred to as an Element
or something covariant with one. We passed the xpath()
function an Xpath expression as its only parameter, but a second parameter was implied, and it was the current element (code
). The xpath()
function is overloaded, so if a second parameter is specified, it will use that where it would otherwise have used the current element: but that particular second parameter does need to be roughly the same kind of data as the current element. If it is not, a Container
comprising one item, the first in the sequence produced when Saxon evaluated the Xpath expression, will be returned by xpath()
.
All that must mean that although X-definition supports Xpath 3.1, there is a catch? Not so. A certain perspective on the scenario that the above paragraph just presented can be won by recalling that a regular expression such as (^(\D.*?)$\n)+
won't usually give your code access to anything but the very last group of one or more. .NET will hand them over, but it doesn't follow that Python or Java have bugs.
It is perfectly true that in an XSLT 3.0 stylesheet, an instruction dedicated to producing output consequent on the evaluation of an Xpath expression always is available:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"> <xsl:output method="text"/> <xsl:template match="/"> <xsl:value-of select="let $b := 'away I go' return (0 to string-length($b)) ! (' ' || substring($b,1,string-length($b) - .))"/> <xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>
That is convenient. The home truth is that an Xpath expression evaluates to a sequence of upwards of zero items: if the method from your API that you wish to target handles a sequence of items, as X-definition will provided that the items shall be obtained from an XML document or a fragment thereof, your Xpath expression can produce most anything, but if you have a reason to think that your target method will only work with one item or will expect only one, your Xpath expression needs to incorporate a function, as the person who answered the question on Stack Overflow understood where I had not.
Our use of the xpath()
function, which is kind of ubiquitous in X-script (it also figures in code for generating XML markup in one's output), will be trouble-free (no further tedious explanations!) from hereon in because we will be working with items obtained from our XML. Using an X-definition file to test your Xpath expressions before you begin using them on your XML's data remains an option now that you understand how to guarantee that doing so shall work. The Saxonica documentation is nothing if not a ready reference where the functions are concerned, while the recommendations can prove helpful in the same way for the syntax. The Wilfried Grupe website (in German) seems to me a promising means of rounding out one's knowledge, as does the Altova training page.
One can have observed unawares by now that in X-definition code, Xpath need not necessarily be a hallmark the way that it is in the XSL context. This project is big and fat with code that served its purpose without using Xpath.
What does amount to a potential hallmark of X-definition is X-script's forget
command, which I've used pervasively. Because there isn't exactly a prescribed order for reading this Wiki, I will note that this part of it is the first to incorporate Xpath, and also the first to consider scenarios where the use of forget
could be obviated. The forget
command constrains us to writing code that works on the current element or its descendants, where an Xpath expression used at the top level of the section of our X-definition code that follows the markup like xd:def
and xd:declaration
(likely bears an identifier that spells out exactly the same as the value of our xd:def
element's root
attribute) cannot but work on a whole lot more.
This time around, there genuinely is a catch. What your code ultimately will do shall depend on the heap memory it will require once that you run it. It doesn't quite depend on the memory that your system makes available, because the Java Virtual Machine does not take advantage of more than 2 gigabytes. The limitation is the same in kind as what we find when running XSL transformations. Unfortunately, the threshold on the file size looks to me to be lower when running X-definition without the forget
command than will be expected when using XSLT instead.
The following code is intended to replicate a typical use of XSLT, where a template is declared with a value beginning with /
assigned to the match
attribute. The respective subfields of each record that hold the title proper will be sorted and printed out. Our Xpath expression is located in the xd:script
attribute of the element identified as root
in our top-level xd:def
element, namely collection
, so the entire document is in effect in scope:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="marc" root="ls:collection" xmlns:ls="http://www.loc.gov/MARC21/slim"> <ls:collection xd:script="finally outln(xpath('for-each(sort(ls:record/ls:datafield[@tag=\'245\'][1]/ls:subfield[@code=\'a\'][1]),function($a){$a/text()})'));"> <ls:record xd:script="occurs +;options ignoreOther;"> <ls:leader xd:script="occurs +"> string; </ls:leader> <ls:controlfield xd:script="occurs +" tag="string;"> string; </ls:controlfield> <ls:datafield tag="string;" ind1="string;" ind2="string;" xd:script="occurs +;"> <ls:subfield code="string;" xd:script="occurs +;"> string; </ls:subfield> </ls:datafield> </ls:record> </ls:collection> </xd:def>
If you do this (the XML source file is located here), you will run out of heap memory:
java -Xmx2G -cp .:xdef-beginner.jar:/home/curt/SaxonHE10-3J/saxon-he-10.3.jar xdef noforget.xdef columbia/Columbia-extract-20210630-020.xml
It is likely that there are still sufficient resources to be able to do something like this instead:
cat closingcollectiontag # above command prints </collection> head -n 2189793 columbia/Columbia-extract-202106300.xml | cat - closingcollectiontag | java -cp .:xdef-beginner.jar:/home/curt/SaxonHE10-3J/saxon-he-10.3.jar xdef noforget.xdef /dev/stdin
The head
command reduced the size of the source XML by about half (from about 189 megabytes to about 90 megabytes).
Our code as of now is using no fewer than two programming languages that are on a par with bash or C, which has implications. The =
operator cannot be used in X-script to effect comparisons as it can in Xpath. Furthermore, the Xpath expression we have been using, for-each(sort(ls:record/ls:datafield[@tag=\'245\'][1]/ls:subfield[@code=\'a\'][1]),function($a){$a/text()})
, incorporated the sort()
function introduced in Xpath 3.1, where our grasp of X-script's API could have moved us to consider using a function from it named, but by no means declared, identically to Xpath's. Although I have more or less promised that this article would be less tedious the further on that you read, X-script's sort()
method is overloaded to accept Xpath expressions, so at least some discussion of it need not be avoided.
The sort()
method as understood in X-Script is called by a Container
, a type of data peculiar to X-script. The xpath()
function, which we will often use, returns a Container
. Method chaining seems to me a more likely scenario for calling the sort()
function than does calling it by means of a variable bound to a Container
. However, with our MARC21-XML it is more likely than it is obvious:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="marc" root="ls:collection" xmlns:ls="http://www.loc.gov/MARC21/slim"> <ls:collection xd:script="finally outln(xpath('ls:record/ls:datafield[@tag=\'245\'][1]/ls:subfield[@code=\'a\'][1]').sort('text()'));"> <ls:record xd:script="occurs +;options ignoreOther;"> <ls:leader xd:script="occurs +"> string; </ls:leader> <ls:controlfield xd:script="occurs +" tag="string;"> string; </ls:controlfield> <ls:datafield tag="string;" ind1="string;" ind2="string;" xd:script="occurs +;"> <ls:subfield code="string;" xd:script="occurs +;"> string; </ls:subfield> </ls:datafield> </ls:record> </ls:collection> </xd:def>
The above will appear to have worked if you run it as you ran the preceding example, but it worked only because the Xpath expression passed as a potential key to sort()
didn't depend on an XML identifier, which is to say, on either a local name or on a qualified name. That is less than wonderful: if I were interested in preserving the datafield
/ ls:datafield
elements, and not only the subfield
/ls:subfield
elements (in case I wanted the 245 subfields besides a
later on), and accordingly were to break up and subsequently chain back together the Xpath expression to that specific end, Saxon would present an error and X-definition would stop running. Above, the Container
built by the xpath()
function consequent on the Xpath expression passed it as a parameter having evaluated to a sequence of items consisted of document fragments, to wit, the subfield
/ls:subfield
elements. Each time that the Xpath expression passed to X-script's sort()
function was evaluated, the context had been the corresponding document fragment in the Collection
. However, the binding of the ls
namespace to the http://www.loc.gov/MARC21/slim
URI that each of the document fragments still possessed owing to inheritance already was no longer in scope. Were I to have passed an Xpath expression to X-script's sort()
function that contained any of the qualified names found in the Xpath expression that had been passed to xpath()
earlier on in the chain, Saxon would have refused to run any further. Were I to subsequently remove the ls
prefix from the Xpath expression passed to X-script's sort()
function, the code would finish running but the Xpath expression still would not be evaluated, just as that passed to the xpath()
function would not have been evaluated had it not contained the ls
prefix. The document fragments would finally be output in document order rather than sorted.
As you could well already know, it is by no means impossible to do as I have just been proposing:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="marc" root="ls:collection" xmlns:ls="http://www.loc.gov/MARC21/slim"> <ls:collection xd:script="finally outln(xpath('ls:record/ls:datafield[@tag=\'245\'][1]').sort('self::element()[name()=\'datafield\']/child::element()[name()=\'subfield\' and attribute::code=\'a\'][1]'));"> <ls:record xd:script="occurs +;options ignoreOther;"> <ls:leader xd:script="occurs +"> string; </ls:leader> <ls:controlfield xd:script="occurs +" tag="string;"> string; </ls:controlfield> <ls:datafield tag="string;" ind1="string;" ind2="string;" xd:script="occurs +;"> <ls:subfield code="string;" xd:script="occurs +;"> string; </ls:subfield> </ls:datafield> </ls:record> </ls:collection> </xd:def>
In contrast to the Xpath expression that was passed to the xpath()
function, that passed to X-script's sort()
function above is constructed to the exclusion of enhancements and abbreviated syntax.:
self::element()[name()='datafield']/child::element()[name()='subfield' and attribute::code='a'][1]
(The abbreviation ["@
"] for the attribute axis still would have worked). The way to demonstrate that it is effective is to replace the local names used in the predicates (datafield
,subfield
) with arbitrary values like toad
or frog
.
In circumstances like these, you also can still chain X-script's sort()
function to the xpath()
function with a view to using it as one sometimes might the <xsl:sort>
instruction familiar from XSLT. The idea is to pass the shorthand for the context item (.
), which below will be a text node rather than an element or an attribute, to sort()
:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="marc" root="ls:collection" xmlns:ls="http://www.loc.gov/MARC21/slim"> <ls:collection xd:script="finally outln(xpath('ls:record/ls:datafield[@tag=\'245\'][1]/ls:subfield[@code=\'a\'][1]/text()').sort('.'));"> <ls:record xd:script="occurs +;options ignoreOther;"> <ls:leader xd:script="occurs +"> string; </ls:leader> <ls:controlfield xd:script="occurs +" tag="string;"> string; </ls:controlfield> <ls:datafield tag="string;" ind1="string;" ind2="string;" xd:script="occurs +;"> <ls:subfield code="string;" xd:script="occurs +;"> string; </ls:subfield> </ls:datafield> </ls:record> </ls:collection> </xd:def>
This does not amount to a workaround: the output is not of the same kind that it had been in our previous example. The difference is that the key is implied by the Xpath expression passed to xpath()
. Efforts to specify it more meaningfully than by .
when passing an Xpath expression to X-script's sort()
function have been completely abandoned. If you run the above code without passing any parameter at all to sort()
, the results should look the same. When X-script's sort()
method is called without a parameter, the output will depend on whether the items in the Collection
have implemented the Java programming language's Comparable interface (please see the X-definition 4.0 User Manual, pp. 190-191).
Once that we begin considering taking advantage of X-definition's support for Xpath 3.1 and also of the availability of a forget
command in X-script, we can at the very least think about performing certain tasks where we might hesitate more were we writing C# or Java code. More of our imagination remains at our disposal: we're not having to ask ourselves again and again whether the element that holds the data we are seeking has perhaps already been "consumed". Making a copy of an element and binding the copy to a variable in order for our use of Xpath expressions to prove practical isn't necessary:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="marc" root="ls:collection" xmlns:ls="http://www.loc.gov/MARC21/slim"> <xd:declaration> boolean a; boolean b; </xd:declaration> <ls:collection> <ls:record xd:script="occurs +;options ignoreOther;init{b = false;a = false;}forget;"> <ls:leader xd:script="occurs +"> string; onTrue if (getText().substring(6,7) EQ "a") a = true </ls:leader> <ls:controlfield xd:script="occurs +" tag="string;"> string; </ls:controlfield> <ls:datafield tag="string;" ind1="ignore;" ind2="ignore;" xd:script="occurs +;init {if (@tag EQ '050') b = true;}"> <ls:subfield code="string;" xd:script="occurs +;"> string; </ls:subfield> </ls:datafield> finally {if (b EQ true AND a EQ true) {out(xpath("ls:datafield[@tag='050'][1]/ls:subfield[@code='a']/text()").item(0).toString()); if (xpath("ls:datafield[@tag='050'][1]/ls:subfield[@code='b']/text()").item(0).toString().startsWith(".") EQ false) {out(".");} out(xpath("ls:datafield[@tag='050'][1]/ls:subfield[@code='b']/text()")); out(" ");outln(xpath("ls:datafield[@tag='245'][1]/ls:subfield[@code='a']/text()")); </ls:record> </ls:collection> </xd:def>
Our finally
block is executed provided that the system was able to infer consequent on a short analysis of the leader that the record was to do with a monograph (where the datasets in this case also contain records pertaining to media different in kind, like sound recordings), and also provided that a Library of Congress call number was included in the bibliographic record. The xpath()
function assures that we have access to more than one of the 050 field's subfields, and also to a different field (245) in order to put the results in an interesting context; it assures, too, that we're limited to working with the call number that appears first, where more than one can have been assigned. There also exists the possibility that the subfield a
, and not necessarily the 050 field itself, was repeated: here, I've permitted the xpath()
function to populate the Container
, and subsequently used the Container
's item()
function to access the contents of the first a
subfield that had been discovered, where I probably could have obtained the same data by using a predicate in the Xpath expression.
As to displaying the data, I'm afraid that I can best characterize that as something of a work in progress! I've figured that displaying it as a shelflist looks fun. It's necessary to pass the XML to the perl
command before running X-definition. As you learned in the tutorial introduction, X-definition does support regular expressions, but it didn't appear to me that the facilities were really there for replacing the space character that had inadvertently occurred from time to time in subfield a
. In Java, it can be necessary to stick with regular expressions that have been compiled in the enclosing scope when one must work with large files. Methods that are declared in the API and that look pretty convenient can cause a program that iterates over a large amount of input to take a very long time to finish running. The perl
command can be coded more readily and is guaranteed to work, although if a file is very large (the Columbia University Libraries files can be as large as 2 gigabytes when decompressed),, it may prove necessary to use it by itself rather than pipe it to java
(it can be piped to java
thereafter by means of the cat
command), and use the rm
command later to remove the intermediate source XML's footprint on the physical media:
gunzip -c Columbia-extract-20210630-020.xml.gz | perl -0777 -pe 's/(^\s+<datafield tag="050".*?>\n^\s+<subfield code="a">[A-Z]+) +/$1/mg' | java -cp xdef-beginner.jar:/home/curt/SaxonHE10-3J/saxon-he-10.3.jar xdef browse.xdef /dev/stdin | sort -b -t . -k 1,1V -k 2.1,2.1 -k 2.2,2.2 -k 2.3,2.3 -k 2.4,2.4 -k 2.4,2n |perl -0777 -pe 's/^((([A-Z]+\d+)\.\d+\.).*?\n)((^\3\.\d+\..*?\n)*)((^\3\.[A-Z].*?\n)+)((^\3\. .*?\n)*)/$8$6$1$4/mg' | perl -0777 -pe 's/^((([A-Z]+\d+)\.\d+\.).*?\n)((^\3\.\d+\..*?\n)*)((^\3\.[A-Z].*?\n)+)/$6$1$4/mg' | perl -0777 -pe 's/^(([A-Z]+\d*\.) .*?\n)(((^\2\S.*?\n))+)^((\2 .*?\n)+)/$1$6$3/mg' | perl -0777 -pe 's/^(([A-Z]+\d*\.) .*?\n)(((^\2\S.*?\n))+)^((\2 .*?\n)+)/$1$6$3/mg' | perl -0777 -pe 's/^(([A-Z]+\d*\.) .*?\n)(((^\2\S.*?\n))+)^((\2 .*?\n)+)/$1$6$3/mg' | perl -0777 -pe 's/^(([A-Z]+\d*\.)[A-Z].*?\n)(((^\2[A-Z].*?\n))*)^((\2 .*?\n)+)/$6$1$3/mg' > shelflist
(I've given the code the filename browse.xdef
). The result ought to have the ring of plausibility, where the mistakes are few enough that a human being could conceivably have made them. I'm under the impression that each time the sort key is incremented, the perl
command becomes necessary in order to compensate for my computer's locale's consideration of the space character. The other thing that I noticed is that consecutive decimal points appear in the output: I suspect that in order to change that, code could be necessary that checks for a leading decimal point, which the instructions for the 050 field suggest will indicate the absence of a preceding Cutter in subfield a
, whilst subfield b
is being discovered.
Adding other fields, like the physical description area, is pretty easy. As well as the information printed on the book's spine that we're most likely to take notice of first, and the call number, we shall also have the book's dimensions and an indication as to its extent in pages:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="marc" root="ls:collection" xmlns:ls="http://www.loc.gov/MARC21/slim"> <xd:declaration> boolean a; boolean b; </xd:declaration> <ls:collection> <ls:record xd:script="occurs +;options ignoreOther;init{b = false;a = false;}forget;"> <ls:leader xd:script="occurs +"> string; onTrue if (getText().substring(6,7) EQ "a") a = true </ls:leader> <ls:controlfield xd:script="occurs +" tag="string;"> string; </ls:controlfield> <ls:datafield tag="string;" ind1="ignore;" ind2="ignore;" xd:script="occurs +;init {if (@tag EQ '050') b = true;}"> <ls:subfield code="string;" xd:script="occurs +;"> string; </ls:subfield> </ls:datafield> finally {if (b EQ true AND a EQ true) {out(xpath("ls:datafield[@tag='050'][1]/ls:subfield[@code='a']/text()").item(0).toString()); if (xpath("ls:datafield[@tag='050'][1]/ls:subfield[@code='b']/text()").item(0).toString().startsWith(".") EQ false) {out(".");} out(xpath("ls:datafield[@tag='050'][1]/ls:subfield[@code='b']/text()")); out(" ");out(xpath("ls:datafield[@tag='245'][1]/ls:subfield[@code='a']/text()")); out(" ");outln(xpath("string-join(for-each(ls:datafield[@tag='300'][1]/ls:subfield,function($a){concat($a/text(),' ')}))"));}} </ls:record> </ls:collection> </xd:def>
The subfields didn't have to be sought out one at a time. What was necessary was to ensure that the sequence of items that Xpath's for-each
function returned was consolidated to a single result, because we are targeting X-script's outln()
function rather than planning to operate on a Container
object like the xpath()
function returns.
In the following code, we can forego an xd:declaration
element because we're certain that the MARC fields we are interested in will usually be represented in some form in each record, and because we can locate the text nodes and attribute values our output requires using Xpath rather than assign each to a variable once that our X-script code has discovered each as we have done elsewhere:
<?xml version="1.0" encoding="UTF-8"?> <xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="marc" root="ls:collection" xmlns:ls="http://www.loc.gov/MARC21/slim"> <ls:collection> <ls:record xd:script="occurs +;forget;"> <ls:leader xd:script="occurs +"> string; </ls:leader> <ls:controlfield xd:script="occurs +" tag="string;"> string; </ls:controlfield> <ls:datafield tag="string;" ind1="ignore;" ind2="ignore;" xd:script="occurs +;"> <ls:subfield code="string;" xd:script="occurs +;"> string; </ls:subfield> </ls:datafield> finally {outln(xpath("string-join(for-each(ls:datafield[@tag='245'][1]/ls:subfield,function($a){$a/text() || ' '}))")); outln(xpath("string-join(for-each(ls:datafield[@tag='040'][1]/ls:subfield,function($a){$a/text() || ' '}))")); outln(xpath("string-join(for-each(ls:datafield[starts-with(@tag,'5')]/ls:subfield,function($a){$a/../@tag || ' ' || $a/text() || ' ' || ' '}))"));} </ls:record> </ls:collection> </xd:def>
The code first obtains each 245 subfield, which it didn't do the last time around. It then obtains the 040 subfields, where the contributors to the record (e.g. Columbia University Libraries, Yankee Book Peddler) are identified: the codes can be looked up at https://www.oclc.org/en/contacts/libraries.html. (Unchecking the box adjacent "Show only OCLC members" is wise). Finally, the tags for each note area, followed by the note itself, are printed.
What distinguishes our use of Xpath expressions in our X-definition code the most is that we don't have to be concerned about the size of our document, as we would need to be were we using XSLT. I would further observe that the learning curve where Xpath is concerned isn't potentially as great as it can be in XSLT because X-definition code implies a context element where XSLT more or less requires that you assign one.