For at least the last decade, the British Library has made millions and millions of its bibliographic records available as bulk downloads. One download, the British National Bibliography (BNB), comprises the whole of the British National Bibliography as we know it today (including records pertaining to monographs, serials, and Cataloguing-in-Publication data); the other download, the British Library Integrated Catalogue (BLIC), contains records pertaining to monographs held by the Library in its collections though not included in BNB and is considerably larger (1.7 gigabytes as of July 2022, where the BNB amounted to about 1 gigabyte).
The practical introduction that shall follow here can be thought of as the "system requirements" section one has perhaps tried to locate on the download page, as I'm reasonably certain I tried to do the very first time around (and probably thereafter).
Although I'll explain below what the downloaded files are and which of your computer's resources can begin putting them to work for you, an alternative way to get started is to view most of the data found in the currently available downloads (each record is included, but as will sooner or later become apparent, not each and every XML tag was parsed) first in a dedicated GUI. The Java application I made for BNB can be obtained by clicking here, while clicking here will download the corresponding application for BLIC. Either application is huge, because it includes the data, but either is still smaller than the corresponding (decompressed) RDF/XML download from the Library. A much smaller version with the same GUI and functionality that's limited to new records and Cataloguing-in-Publication data offered on a weekly basis by the Library can be downloaded instead by clicking here. The Wiki demonstrating the applications and explaining how to use them is here.
Both BNB and BLIC, as downloaded from the Library, consist of a single .zip
archive. Your system probably can extract all of the files at a stroke, or just a few at a time. On Linux, it's not always necessary to save the files to disc if you're certain as to what you're planning to do with them, because the unzip command can be piped instead. I shall not do that here (although I'll furnish the required code at the end), but it's worth considering if you're downloading a new archive each time an updated one is available.
The very first thing that it's necessary to know about the files each archive contains is that they overwhelmingly CANNOT be opened in a What You See Is What You Get (WYSIWYG) application such as Word or Framemaker, as being much too large. They certainly can't be opened in a web browser, which may try to parse the XML instead of displaying the markup as plain text and hang up. Excel won't necessarily hang up, but is certain to produce nothing resembling a spreadsheet should it try to open any of the files. As I'll explain sooner rather than later, viewing the contents of the files practically always requires a terminal.
As of December 2022, the BNB download contained 70 XML files bearing the .rdf
extension. None was larger than 375 megabytes. BLIC contained 216 files like the ones found in BNB (same format, same file extension, although a few XML tags may occur in BLIC that really don't in BNB), the biggest of which was about 199 megabytes. BNB's footprint on the physical media/cloud once that it had been decompressed was about 19 gigabytes, and BLIC's was about 25 gigabytes.
It's imperative that one understand the files' encoding and its implications.
The line separator is the type found on Windows, not that expected on MacOs or Linux: to wit, it is the carriage return followed by the linefeed (\r\n
).
The encoding is UTF-8. Sometimes, XML files with this encoding (the standard encoding for XML documents) that you download or otherwise receive can cause certain applications to stop running. Error messages like the following subsequently appear:
Error on line 198370 column 21 of BLICBasicB_202102_f93.rdf:
SXXP0003 Error reported by XML parser: Invalid byte 1 of 1-byte UTF-8 sequence.: Invalid
byte 1 of 1-byte UTF-8 sequence.. Caused by
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1
of 1-byte UTF-8 sequence.
org.xml.sax.SAXParseException; systemId: file:/home/curt/Documents/blic/BLICBasicB_202102_f93.rdf; lineNumber: 198370; columnNumber: 21; Invalid byte 1 of 1-byte UTF-8 sequence.
As of July 2022, the releases of BNB and BLIC weren't causing that to happen on my system; previous releases (e.g. from the first half of 2021 and earlier) had used to do. Should you ever see messages identical to the foregoing on your computer, you can preempt the errors (in a Linux terminal) by piping your application's input to a standard command:
cat blic/BLICBasicB_202102_f93.rdf | iconv -c | java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:bnbtitles.xsl -s:/dev/stdin
where the error messages appear (again, highly unlikely if you've just downloaded BNB or BLIC) after attempting to run this command:
java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:bnbtitles.xsl -s:blic/BLICBasicB_202102_f93.rdf
I think that such error messages come from the XML parser:
xmlwf blic/BLICBasicB_202102_f93.rdf
# complained: blic/BLICBasicB_202102_f93.rdf:198370:20: not well-formed (invalid token)
cat blic/BLICBasicB_202102_f93.rdf | iconv -c | xmlwf
# xmlwf returned silently without reporting any trouble
BNB and BLIC certainly are not encoded in UTF-8 only in name: throughout the files, untold tens of thousands of characters peculiar to countless languages are deployed. If one tries to view the files on a Windows terminal, question marks may appear instead of quite a few of the characters if the terminal's default encoding is a legacy code page rather than UTF-8. Unfortunately, that is to do with the platform and not with the files themselves: even once that settings are changed, the font the Windows terminal uses may not be able to display all of the characters used throughout BNB and (particularly) BLIC, and a different font may need to be assigned.
Still more unfortunately, that brings me to the differences among operating systems. Having to configure Windows to display Unicode characters in the terminal can be annoying and entail trial and error (again, although Notepad or Wordpad could possibly open the files and display them correctly, the files are almost certainly too large for either application to succeed in opening any), but it is much better to avoid even reaching all of that because the Windows terminal really isn't equipped to work with very large files, especially files that contain markup and will thereby require it to do extra work. Microsoft offers an Ubuntu Linux terminal for Windows 10 that is identical to the terminal used in this article, and installing it is really wise. PowerShell can't work as quickly as the Linux terminal and has a steep learning curve: one ends up with an incentive to migrate one's scripts to C# in order for the code to run considerably faster, and little if any incentive to do the same things again and again with PowerShell's built-in commands.
On a Linux terminal, the less command can be used to look at the BNB and BLIC files (you can download the file below here):
less BNBBasic_202105_f59.rdf
We opened a BNB file. I mentioned that the most recent BNB download contains about 68 files: all but the last consist of 75,000 records each. That's quite a few presses of the space
bar (the usual means of advancing one page in the less terminal, although the Page Up
and Page Down
keys are likely to work too). The less terminal supports free-text searching (the manual page shows how the /
operator and the n
/shift
+ n
keys facilitate it; I also discuss it in this project's tutorial on X-definition), but there happens to be at least 68 files. You could conceivably pass the names of every last one to less when opening it. Prepending a flag to the pattern will (according to the manual page) cause less to look for it in every file until the data is exhausted.
A more familiar means of looking for data in an XML file like those in BNB and BLIC is an XSL transformation. I think that you can do one against any file in BNB or BLIC, provided that your computer has at least 2 gigabytes of RAM installed:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:if test="dcterms:language/rdf:Description/rdfs:label[text()='ger']">
<!--There are 975 in all! -->
<xsl:value-of select="dcterms:title"/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The code just printed all of the titles in the file, provided that the bibliographic record unequivocally stated that the language was German. I'll cover the XML markup (e.g., rdf:Description
) later in this article. The point of furnishing the example is that it ran against the largest file in the May 2021 BNB download. It is so that when I tried running it using a laptop computer with no more than 2 gigabytes of RAM installed, it did not work at first. The application reported that Java had run out of memory, and stopped. It ultimately did run once that I had allocated more memory by adding the -Xmx
option to the java
command:
java -Xmx2G -jar saxon-he-10.3.jar -xsl:biggest.xsl -s:BNBBasic_202105_f59.rdf
I assigned 2 gigabytes based solely on the assumption that more was not really available. If you try to run XSL transformations using Java and error messages concerning the heap memory persist, the very first thing to do is make certain that you are using a 64-bit operating system and, especially, that your version of Java (you can learn it by typing java -version
) is a 64-bit, and not a 32-bit, version.
With both BNB and BLIC, the number of records and the size of the file are very different considerations because the Library's bibliographic description has evolved through the years, and records that are recent are therefore potentially rather larger than ones that are old. As mentioned earlier, it would appear that the Library splits BNB up into files 75,000 records long each time that a new archive is assembled, and a new download offered. As of summer 2022, the maximum number of records in a BLIC file, pending a semi-annual refresh, appears to be 50,000.
Markup accounts for rather a lot of each file's size in bytes, and even for much of each record's extent. When examining the markup, one can take note early on that it doesn't use mixed content. The XML elements either contain a text node, or one or more child elements. Also of note is that the elements really do require qualified names: in at least one case (dcterms:type
vs. rdf:type
), two distinct elements would otherwise bear the same name. If you're accessing the files by means of other XML vocabularies, like XSL or X-definition, declaring the namespaces at the earliest opportunity (i.e., the top level of your document) just as the BNB and BLIC files all do is imperative. Should you not do, some applications may protest and refuse to run, where others, worse yet, may in effect ignore some or all of your code.
Your applications will seldom ever need to take account of attributes like in MARC-XML: their content in BNB and BLIC is by and large boilerplate.
You may already have noticed that on the Library's download page, sample files for both BNB and BLIC are offered. It really isn't for me to say whether they are representative or whether they reflect everything that could have changed over the last ten years or more. The following are all of the elements found therein:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:bibo="http://purl.org/ontology/bibo/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:isbd="http://iflastandards.info/ns/isbd/elements/"
xmlns:owlt="http://www.w3.org/2006/time#"
xmlns:rda="http://rdvocab.info/Elements/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#">
<rdf:Description>
<dcterms:title>
<!-- MARC 245 subfields $a, $b, $n, $p -->
</dcterms:title>
<dcterms:alternative>
<!-- MARC 130 subfields $a, $d, $f, $g, $k, $l, $m, $n, $o, $p, $r, $s, $t -->
<!-- MARC 240 subfields $a, $d, $f, $g, $k, $l, $m, $n, $o, $p, $r, $s -->
</dcterms:alternative>
<dcterms:creator>
<!-- MARC 100 subfields $a, $b, $c, $d, $g, $j, $q, $u -->
<!-- @resource set to http://xmlns.com/foaf/0.1/Person -->
<!-- MARC 110 subfields $a, $b, $c, $d, $g, $n, $u -->
<!-- @resource set to http://xmlns.com/foaf/0.1/Organization -->
<!-- MARC 111 subfields $a, $c, $d, $e, $g, $n, $q. $u -->
<!-- @resource set to http://purl.org/ontology/bibo/Conference -->
<rdf:Description>
<rdf:type rdf:resource=""/>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:creator>
<dcterms:contributor>
<!-- MARC 700 subfields $a, $b, $c, $d, $g, $j, $q, $u -->
<!--@resource set to http://xmlns.com/foaf/0.1/Person -->
<!-- MARC 710 subfields $a, $b, $c, $d, $g, $n, $u -->
<!--@resource set to http://xmlns.com/foaf/0.1/Organization -->
<!-- MARC 711 subfields $a, $b, $c, $d, $g, $n, $q, $u -->
<!--@resource set to http://purl.org/ontology/bibo/Conference -->
<rdf:Description>
<rdf:type rdf:resource=""/>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:contributor>
<dcterms:relation>
<!-- MARC 700 subfields $a, $b, $c, $d, $f, $g, $h, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u -->
<!-- if there is $t or $k in 700 -->
<!-- MARC 710 subfields $a, $b, $c, $d, $f, $g, $h, $k, $l, $m, $n, $o, $p, $r, $s, $t, $u -->
<!-- if there is $t or $k in 710 -->
<!-- MARC 711 subfields $a, $c, $d, $e, $f, $g, $h, $k, $l, $n, $p, $q, $s -->
<!-- if there is $t in 711 -->
<!-- MARC 730 subfields $a, $d, $f, $g, $h, $k, $l, $m, $n, $o, $p, $r, $s, $t -->
<!-- MARC 76X-78X subfields $a, $b, $c, $d, $h, $k, $m, $n, $p, $s, $t, except 760, 773, 774, 775, 776, 780, 785 -->
<rdf:Description>
<bibo:isbn>
<!-- 76x-78x subfield $z except 760, 762, 777 (for the resource to which the described resource is related) -->
</bibo:isbn>
<bibo:issn>
<!-- 730 subfield $x (for the resource to which the described resource is related) -->
<!-- 76X-78X subfield $x (for the resource to which the described resource is related) -->
</bibo:issn>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:relation>
<dcterms:type>
<!-- MARC LEADER/06 a or t (text), d,f,p or t (manuscript) LEADER/07 a or m (monographic), b, i or s (continuing), c (collection) -->
<!-- MARC 007/00 h and 01 d (microfilm reel), 007/00 h and 01 e (microfiche), 007/00 h and 01 | (microform) -->
<!-- MARC 008/23 o, q, s for BK, CR, CF, MU, MM , 008/29 o, q, s for MP, VM (electronic) -->
<rdf:Description>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:type>
<isbd:P1016>
<!-- MARC 260 subfield $a -->
<!-- MARC 264 indicator1=blank indicator2=1, subfield $a -->
<!-- (hasPlaceOfPublicationProductionDistribution) -->
<rdf:Description>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</isbd:P1016>
<rda:placeOfPublication>
<!-- MARC 008/15-17 -->
<rdf:Description>
<rdf:type rdf:resource=""/>
<rdfs:label>
</rdfs:label>
<skos:inScheme rdf:resource=""/>
</rdf:Description>
</rda:placeOfPublication>
<dcterms:description>
<!-- Forthcoming publication -->
<!-- MARC LEADER/17, 8 -->
<!-- MARC 040 $a StDuBDS -->
<!-- MARC 263 -->
<!-- MARC 500-599, except 505, 506, 510, 516, 520, 521, 538, 540, 546 -->
</dcterms:description>
<dcterms:publisher>
<!-- MARC 260 subfield $b -->
<!-- MARC 264 indicator1=blank indicator2=1, subfield $b -->
<rdf:Description>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:publisher>
<dcterms:issued>
<!-- concatenation of MARC 008/07-10 and 008/11-14 -->
<!-- MARC 008/06 e, LEADER/06 d, f, p or t, 008/07-10 -->
<!-- not the same as MARC 260$c -->
<!-- MARC 264 indicator1=blank indicator2=1, subfield $c -->
</dcterms:issued>
<dc:date>
<!-- MARC 008/06 c, d, i, k, m or u ; 008/06 q -->
<owlt:interval>
<owlt:hasBeginning>
<owlt:Instant>
<!-- MARC 008/07-10 -->
<owlt:inXSDDateTime>
<!-- MARC 008/07-10 -->
</owlt:inXSDDateTime>
</owlt:Instant>
</owlt:hasBeginning>
<owlt:hasEnd>
<owlt:Instant>
<!-- MARC 008/11-14 -->
<owlt:inXSDDateTime>
<!-- MARC 008/11-14 -->
</owlt:inXSDDateTime>
</owlt:Instant>
</owlt:hasEnd>
</owlt:interval>
</dc:date>
<dcterms:dateCopyrighted>
<!-- MARC 264 indicator1=blank indicator2=4, subfield $c -->
</dcterms:dateCopyrighted>
<dcterms:language>
<!-- MARC 008/35-37 -->
<!-- only mapped if there is no 041 -->
<!-- MARC 041 indicator2=blank subfield $a -->
<!-- ISO 639-2 -->
<rdf:Description>
<rdf:type rdf:resource=""/>
<rdfs:label>
</rdfs:label>
<skos:inScheme rdf:resource=""/>
</rdf:Description>
</dcterms:language>
<isbd:P1008>
<!-- MARC 250 subfield $a -->
<!-- (hasEditionStatement) -->
</isbd:P1008>
<dcterms:extent>
<!-- MARC 300 subfield $a -->
<rdf:Description>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:extent>
<isbd:P1038>
<!-- MARC 362 subfield $a -->
<!-- number in MARC 300 subfield $a -->
</isbd:P1038>
<isbd:P1073>
<!-- (not documented) -->
</isbd:P1073>
<bibo:numVolumes>
<!-- MARC 300 subfield $a -->
</bibo:numVolumes>
<dcterms:tableOfContents>
<!-- MARC 505 subfields $a, $q, $t, $r -->
</dcterms:tableOfContents>
<dcterms:abstract>
<!-- MARC 520 subfields $a, $b -->
</dcterms:abstract>
<dcterms:requires>
<!-- MARC 538, subfield $a -->
<rdf:Description>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:requires>
<dcterms:accessRights>
<!-- MARC 506 subfields $a, $d -->
<!-- MARC 540 subfields $a, $d -->
<rdf:Description>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:accessRights>
<rda:dissertationOrThesisInformation>
</rda:dissertationOrThesisInformation>
<dcterms:audience>
<!-- MARC 521 subfield $a -->
<rdf:Description>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:audience>
<dcterms:isReferencedBy>
<!--MARC 510 subfield $a -->
<rdf:Description>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:isReferencedBy>
<dcterms:subject>
<!-- MARC 082 subfields $a, $2 -->
<!-- + DDC edition number -->
<!-- MARC 600/610/611/630/650/651 indicator2 = 0 (LCSH), 2(MeSH), otherwise unspecified; all subfields -->
<!--MARC 653 subfield $a -->
<rdf:Description>
<rdf:type rdf:resource=""/>
<rdfs:label>
</rdfs:label>
<skos:inScheme rdf:resource=""/>
<skos:notation rdf:datatype="">
</skos:notation>
</rdf:Description>
</dcterms:subject>
<dcterms:spatial>
<!-- MARC 651 indicator2 = 0 subfield $a -->
<rdf:Description>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:spatial>
<rda:seriesStatement>
<!-- MARC 490 subfields $a, $x, $v -->
</rda:seriesStatement>
<dcterms:isPartOf>
<!-- MARC 490 subfield $a -->
<!-- MARC 760 subfields $a, $b, $c, $d, $h, $m, $n, $s, $t -->
<!-- MARC 773 subfields $a, $b, $d, $h, $k, $m, $n, $p, $q, $s, $t -->
<!-- MARC 800 subfields $a, $b, $c, $d, $f, $g, $k, $l, $m, $n, $o, $p, $q, $s, $t, $u, $v -->
<!-- MARC 810 subfields $a, $b, $c, $d, $f, $g, $j, $k, $l, $m, $n, $o, $p, $r, $s, $t, $u, $v -->
<!-- MARC 811 subfields $a, $c, $d, $e, $f, $g, $j, $k, $l, $n, $p, $q, $r, $s, $t, $u, $v -->
<!-- MARC 830 subfields $a, $b, $c, $d, $f, $g, $k, $l, $m, $n, $o, $p, $r, $s, $t, $v -->
<rdf:Description>
<bibo:issn>
<!-- MARC 490 subfield $x -->
<!-- (for the series of which the described resource is a part) -->
<!-- 760 subfield $x (for the resource to which the described resource is related) -->
<!-- 773 subfield $x (for the resource to which the described resource is related) -->
<!-- 830 subfield $x (for the series of which the described resource is a part) -->
</bibo:issn>
<bibo:isbn>
<!-- 773 subfield $z (for the resource to which the described resource is related) -->
</bibo:isbn>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:isPartOf>
<dcterms:hasVersion>
<!-- MARC 775 subfields $a, $b, $c, $d, $e. $f. $h, $k, $m, $n, $s, $t -->
<rdf:Description>
<bibo:issn>
<!-- 775 subfield $x (for the resource to which the described resource is related) -->
</bibo:issn>
<bibo:isbn>
<!-- 775 subfield $z (for the resource to which the described resource is related) -->
</bibo:isbn>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:hasVersion>
<dcterms:hasFormat>
<!-- MARC 776 subfields $a, $b, $c, $d, $h, $k, $m, $n, $s, $t -->
<rdf:Description>
<bibo:isbn>
<!-- 776 subfield $z (for the resource to which the described resource is related) -->
</bibo:isbn>
<bibo:issn>
<!-- 776 subfield $x (for the resource to which the described resource is related) -->
</bibo:issn>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:hasFormat>
<bibo:isbn10>
<!-- MARC 020 subfield $a (qualifiers are excluded) -->
</bibo:isbn10>
<bibo:isbn13>
<!-- MARC 020 subfield $a (qualifiers are excluded) -->
</bibo:isbn13>
<bibo:issn>
<!-- MARC 022 subfield $a -->
</bibo:issn>
<rda:termsOfAvailability>
<!-- MARC 020 subfield $c (qualifiers are excluded) -->
</rda:termsOfAvailability>
<dcterms:replaces>
<!-- MARC 780 subfields $a, $b, $c, $d, $h, $k, $m, $n, $s, $t -->
<rdf:Description>
<bibo:isbn>
<!-- 780 subfield $z (for the resource to which the described resource is related) -->
</bibo:isbn>
<bibo:issn>
<!-- 780 subfield $x (for the resource to which the described resource is related) -->
</bibo:issn>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:replaces>
<dcterms:isReplacedBy>
<!-- MARC 785 subfields $a, $b, $c, $d, $h, $k, $m, $n, $s, $t -->
<rdf:Description>
<bibo:issn>
<!-- 785 subfield $x (for the resource to which the described resource is related) -->
</bibo:issn>
<bibo:isbn>
<!-- 785 subfield $z (for the resource to which the described resource is related) -->
</bibo:isbn>
<rdfs:label>
</rdfs:label>
</rdf:Description>
</dcterms:isReplacedBy>
<dcterms:identifier>
<!-- MARC 001 -->
<!-- preceded by BL MARC organisation code (Uk) -->
<!-- MARC 015 subfield $a -->
<!-- MARC 020 subfield $a (expressed as urn:isbn) -->
<!-- MARC 022 subfield $a (expressed as urn:issn) -->
<!-- MARC 856 indicator1=4 indicator2=0 subfield $u -->
</dcterms:identifier>
<rdfs:seeAlso rdf:resource=""/>
<!-- MARC 856 indicator1=4 indicator2=1 or 2 subfield $u -->
</rdf:Description>
</rdf:RDF>
The comments are taken from the Library's document mapping the tags to their corresponding MARC fields.
The enclosing element (one per document/file) is rdf:RDF
.
The rdf:Description
element is polyvalent. It encloses each individual record, but it also occurs within each record as a child of such elements as dcterms:language
or dcterms:subject
.
The order in which you find the elements above might not always prove a reflection of the order in which they occur in either BNB or BLIC: I've come as close as I figured was possible. It appeared to me that the dcterms:description
and dcterms:relation
elements can occur as direct children of rdf:Description
practically anywhere in BNB and BLIC where the context warrants it. The elements' document order can matter if your application isn't able to take advantage of Xpath.
With that in mind, XSLT shall be used for the examples to follow, The application for running the stylesheets can be downloaded from SourceForge's Saxon page. The version I shall use is for Java.
Code earlier in this article used the dcterms:title
element:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:value-of select="dcterms:title"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The above code simply prints the contents of each dcterms:title
element, present in practically every record, to the terminal. The command on my system was:
java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:xslt.xsl -s:blic/BLICBasicB_202105_f94.rdf
where BLICBasicB_202105_f94.rdf
is the name of the small file I chose against which to do the XSL transformation. I named the XSL transformation xslt.xsl
: the general idea is that when you encounter sample code during the discussion to follow, you can just copy it to your system's clipboard and replace xslt.xsl
's code. As I mentioned earlier, the above command might not work consistently if you are looking at an older release of BNB or BLIC. The command ran against BLICBasicB_202105_f94.rdf
last year, but not against the preceding file from the same archive:
java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:xslt.xsl -s:blic/BLICBasicB_202105_f93.rdf
Error messages would appear:
Error on line 198370 column 21 of BLICBasicB_202105_f93.rdf:
SXXP0003 Error reported by XML parser: Invalid byte 1 of 1-byte UTF-8 sequence.: Invalid
byte 1 of 1-byte UTF-8 sequence.. Caused by
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1
of 1-byte UTF-8 sequence.
org.xml.sax.SAXParseException; systemId: file:/home/curt/Documents/blic/BLICBasicB_202105_f93.rdf; lineNumber: 198370; columnNumber: 21; Invalid byte 1 of 1-byte UTF-8 sequence.
As a reminder, should that occur, it can be worked around:
cat blic/BLICBasicB_202105_f93.rdf | iconv -c | java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:xslt.xsl -s:/dev/stdin
That ran just as running the transformation against BLICBasicB_202105_f94.rdf
had.
Unlike in MARC-XML, the dcterms:title
element can be expected to contain all of the title's parts. (There is a related field, dcterms:alternative
, that I'll mention again later). You're not required to poll more that one subfield in order to retrieve all of the punctuation and accompanying information.
Using this code, you can glimpse the authors' names:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="dcterms:creator | dcterms:contributor">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The Library's mapping to the MARC fields suggests that the dcterms:creator
tag is to do with the main entry, so it makes sense to take a glimpse only of these:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="dcterms:creator">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The pipe operator and the argument on its right-hand side are gone, though the rest is the same. The xsl:value-of
element's select
attribute's value will work later with certain elements other than dcterms:creator
.
The Library's RDF/XML retains the distinction between the 100, 110, and 111 fields by means of an attribute on rdfs:label
's sibling element, rdfs:type
:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="dcterms:creator[rdf:Description/rdf:type/@rdf:resource = 'http://xmlns.com/foaf/0.1/Person']">
<!--<xsl:for-each select="dcterms:creator[rdf:Description/rdf:type/@rdf:resource = 'http://xmlns.com/foaf/0.1/Organization']">-->
<!--<xsl:for-each select="dcterms:creator[rdf:Description/rdf:type/@rdf:resource = 'http://purl.org/ontology/bibo/Conference']">-->
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
To see the corporate bodies or conferences, the relevant line can be uncommented, and the one left uncommented previously commented out.
Particularly when it's not qualified with its namespace's identifier, the name of the isbd:P1016
tag may seem cryptic. It refers to the name of a city or similar jurisdiction that can serve to disambiguate when two or more publishers have similar names. You've encountered it in some form in books about various formats for bibliographic citations, or in the Anglo-American Cataloguing Rules/International Standard Bibliographic Description:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="isbd:P1016">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The rda:placeOfPublication
tag is different to isbd:P1016
. It's the land where the monograph or other resource was published, as assigned by the Library, and consists of a code rather than of text copied from the exemplar's title page (or its verso):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="rda:placeOfPublication">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The above just prints the codes that it finds out one after the other. To obtain more of an overview, you can pipe the XSL transformation to other commands:
java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:xslt.xsl -s:blic/BLICBasicB_202105_f92.rdf | sort | uniq -c | less
The uniq
command's -c
option added the number of times each code occurred in the file. You can find the codes at https://id.loc.gov/vocabulary/countries.html.
One thing of note about dcterms:publisher
is that its child elements have a similar hierarchy to that of the elements we've already discussed, excepting dcterms:title
:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="dcterms:publisher">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
It's also necessary to be aware that dcterms:publisher
and other elements can occur more than once in a bibliographic record:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="dcterms:publisher[1]">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Once that I included a predicate after dcterms:publisher
, and thereby indicated that I was interested only in the first element, the output consisted of fewer lines.
When an element is repeated within a bibliographic record, it would appear to be repeated as an alternative to using the element just once and enclosing more than one rdfs:label element inside rdf:Description.
The dcterms:issued
element is where you basically obtain the date of publication. Unlike most of the elements discussed up until now, it does not have descendants:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="dcterms:issued">
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Its content is typically a year, or a year followed by a month. The dc:date
element appears to be applicable only to continuing resources/serials.
The dcterms:language
element is more similar than dcterms:issued
or dcterms:title
to some of the other elements where looking for its text content is concerned, and like rda:placeOfPublication
, it ultimately encloses a code::
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:value-of select="count(dcterms:language)"/>
<xsl:text>
</xsl:text>
<xsl:for-each select="dcterms:language">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
A potential distinction is that it's often repeated. Displaying the result of the above XSL transformation as follows demonstrates that it was repeated up to five times in the input file:
java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:xslt.xsl -s:blic/BLICBasicB_202105_f88.rdf | grep -P '\d+' | sort | uniq
Instead of repeating dcterms:language
, some records may still use only one rdfs:label
(it seems always ever limited to one) and assign a concatenation of the codes as its content:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="dcterms:language[string-length(rdf:Description/rdfs:label) gt 3]">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The output from the above XSL transformation run as before against BLICBasicB_202105_f88.rdf
was:
gerspa
freger
dumdut
dutfre
grclat
spalat
spalat
The three-letter codes are at https://id.loc.gov/vocabulary/iso639-2.html.
For the edition statement, one locates isbd:P1008
, which has no child elements:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="isbd:P1008">
<!--<xsl:for-each select="isbd:P1038">-->
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The isbd:P1038
element appears to parallel dc:date
, did not appear to occur frequently, and is furnished as an alternative value for the xsl:for-each
tag's select
attribute in the above code but commented out.
The part of the physical description that covers the total extent of the item (e.g., the number of pages) is here the dcterms:extent
element:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="dcterms:extent">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The isbd:P1073
element is on a par with dcterms:description
, both in that it has no child elements, and also in that its content is proper to a conventional bibliographic record's note area. Specifically, the note enclosed by the isbd:P1073
tag is to do with an item's language:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:for-each select="isbd:P1073">
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
For instance, the last line that the above code output when run against BLICBasicB_202105_f88.rdf
read In Latvian: summaries in English
.
Two elements that can make for interesting reading tend to occur in more recent records, and are dcterms:tableOfContents
and dcterms:abstract
:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:if test="dcterms:abstract | dcterms:tableOfContents">
<xsl:apply-templates select="dcterms:title | dcterms:publisher[1] | dcterms:creator[1] | dcterms:issued | dcterms:abstract | dcterms:tableOfContents"/>
</xsl:if>
</xsl:for-each>
</xsl:template>
<xsl:template match="dcterms:title | dcterms:issued">
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="dcterms:publisher[1] | dcterms:creator[1]">
<xsl:value-of select="rdf:Description/rdfs:label"/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="dcterms:abstract | dcterms:tableOfContents">
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
The above prints the title, first author's name, first publisher's name, date issued, abstract, and table of contents, provided that the record contains one of the latter two, and could even be worth keeping. The results after running against BLICBasicB_202105_f80.rdf
seemed representative where concerns recent records.
It's also possible to check for an rda:seriesStatement
element (neither it nor dcterms:title
have child elements):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:if test="rda:seriesStatement">
<xsl:value-of select="dcterms:title"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="rda:seriesStatement"/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Things work similarly for the dcterms:alternative
element (an alternative form of the title):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:if test="dcterms:alternative">
<xsl:value-of select="dcterms:title"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="dcterms:alternative"/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The above prints the title and any alternative forms for each bibliographic record.
Regarding the dcterms:subject
element, the first thing to understand is that its only child element is a single rdf:Description
element. The only child element of rdf:Description
as found inside dcterms:subject
that has text content (also, the only descendant element of dcterms:subject
that has text content) is either rdfs:label
, or skos:notation
. If another rdfs:label
or skos:notation
element proves necessary, the dcterms:subject
element is repeated as a child element of the record's enclosing rdf:Description
. Neither rdfs:label
nor skos:notation
occur one after the other: each has its own enclosing dcterms:subject
element.
The skos:notation
element is where you find the Dewey number, if one has been assigned:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="xml" indent="true"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:if test="dcterms:subject/rdf:Description[starts-with(skos:notation,'8')]">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The above XSL transformation prints all the records (with the XML markup intact) that were assigned at least one Dewey number that was in the 800 (literature) class. If there is more than one Dewey number, each will be inside its own dcterms:subject
element.
Where a Dewey number is assigned, it looks quite like one can count on the applicable edition of the Dewey Decimal Classification (DDC) being specified within the record:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:if test="dcterms:subject/rdf:Description/skos:notation">
<xsl:for-each select="dcterms:subject/rdf:Description/skos:notation">
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
<xsl:value-of select="following-sibling::skos:inScheme[1]/@rdf:resource"/>
<xsl:text> </xsl:text>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
What the XSL transformation used this time prints is the Dewey number (or series of same if more than one was assigned), and also the value of the following skos:inScheme
element's rda:resource
attribute. These values (e.g. http://dewey.info/scheme/e19/
) can be characterized as identifiers, but not as locators. The dewey.info
URL was abandoned by the Online Computer Library Center rather long ago.
The subject headings are obtained from the dcterms:subject
element's rdfs:label
child element. I'm afraid that I'm not utterly well-informed about the Library's subject indexing practices past and present. The XSL transformation permitting one to see only subject headings in either BNB or BLIC that the XML markup does not suggest are Library of Congress Subject Headings (LCSH) or National Library of Medicine headings is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<!-- uncomment in order to print the records themselves instead, and comment out the redundant xsl:output tag: -->
<!--<xsl:output method="xml" indent="true"/>-->
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:if test="dcterms:subject/rdf:Description/rdfs:label[not(preceding-sibling::skos:inScheme[last()]/@rdf:resource)]">
<!-- uncomment in order to print the entire record with the XML intact (for context) instead, and comment out the for-each loop
(and also remember to effect the corresponding change to the xml-output tag) -->
<!--<xsl:copy-of select="."/>-->
<!-- open the comment added to see the XML instead at the very beginning of the following line -->
<xsl:for-each select="dcterms:subject/rdf:Description/rdfs:label[not(preceding-sibling::skos:inScheme[last()]/@rdf:resource)]">
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
</xsl:for-each>
<!-- close the comment added to see the XML instead at the very end of the above line -->
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
I've included instructions for viewing the XML instead of viewing only the subject headings themselves. Really, I do not know whether or not the subject headings we find in this manner are legacies of the twentieth century, or could be more recent.
The following code is more or less the opposite of the preceding XSL transformation:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:if test="dcterms:subject/rdf:Description/rdfs:label[preceding-sibling::skos:inScheme[last()]/@rdf:resource]">
<xsl:for-each select="dcterms:subject/rdf:Description/rdfs:label[preceding-sibling::skos:inScheme[last()]/@rdf:resource]">
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
<xsl:value-of select="preceding-sibling::skos:inScheme[last()]/@rdf:resource"/>
<xsl:text> </xsl:text>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
You saw earlier how the skos:notation
element bearing the Dewey number had a corresponding skos:inScheme
tag that in effect identified the edition of the DDC presumably used to assign the code. Similarly, the rdfs:label
element's corresponding skos:inScheme
element bears an attribute the value of which in some way identifies a classification scheme:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:if test="dcterms:subject/rdf:Description/rdfs:label[preceding-sibling::skos:inScheme[last()]/@rdf:resource = 'http://www.nlm.nih.gov/mesh']">
<xsl:for-each select="dcterms:subject/rdf:Description/rdfs:label[preceding-sibling::skos:inScheme[last()]/@rdf:resource = 'http://www.nlm.nih.gov/mesh']">
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
<xsl:value-of select="preceding-sibling::skos:inScheme[last()]/@rdf:resource"/>
<xsl:text> </xsl:text>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The above XSL transformation shows the subject headings identified in the skos:inScheme
element's rdf:resource
attribute as being from the (American) National Library of Medicine's vocabulary. (Not each and every file that is part of either BNB or BLIC will necessarily contain any of these). As far as I have been able to tell, if a heading is not taken from there, and if there is a skos:inScheme
tag, the skos:inScheme
tag's rdf:resource
attribute's value can be accepted on its face and it can be assumed that the subject heading is from LCSH.
Unlike the identifiers assigned to the rdf:resource
attribute in order to differentiate between consecutive editions of the DDC, those pertaining to the subject headings, http://id.loc.gov/authorities/subjects and http://www.nlm.nih.gov/mesh, work when used as URLs and can be used in the browser to find more information about the respective means of subject indexing.
The document that I showed you earlier that contained one of each of the tags that the Library appeared to have used in its sample XML data for both BNB and BLIC is potentially confusing where bibo:isbn10
and bibo:isbn13
are concerned because either tag can be enclosed by certain other elements. When one is looking for the International Standard Book Number (ISBN) pertinent to a specific record, one does so by looking for the bibo:isbn10
or bibo:isbn13
tag that is an immediate descendant (child) of the record's enclosing rdf:Description
element:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description/bibo:isbn10[starts-with(text(),'0') or starts-with(text(),'1')]
| rdf:Description/bibo:isbn13[starts-with(text(),'9780') or starts-with(text(),'9780')]">
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
All that the above XSL transformation accomplished was to print out the ISBNs encountered in the file that were assigned by an agency in the English language registrant group. While looking for an ISBN that you know is easy, discovering ISBNs from code based on their relevance can be discouraging, just as when the records one has are in a format based on MARC. It certainly isn't guaranteed that every element will have either a bibo:isbn10
or a bibo:isbn13
element, because not everything has ever been assigned an ISBN. Where an ISBN is available, there's not a guarantee that a given record will have both a bibo:isbn10
and a bibo:isbn13
tag, nor that a given record will have either one or the other but not both. A record certainly can have more than one of either tag.
The tag one might habitually associate with either bibo:isbn10
or bibo:isbn13
, namely rda:termsOfAvailability
, also occurs as a child of rdf:Description
and is to be found both in BNB and BLIC:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="text"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description[rda:termsOfAvailability]">
<xsl:value-of select="rda:termsOfAvailability"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The results that you see could depend on the file against which the XSL transformation is run, and could be more distinctive when a file with many older records (e.g., records from the mid-twentieth century in BNB) is used.
As to the last of the tags that I intend to discuss, dcterms:identifier
, it is omnipresent. It looked to me like it consistently occurred at the very end of the record and would only ever be followed by a rdfs:seeAlso
tag (which appears rare) were it ever not the record's last tag of all.
While it is everywhere, the dcterms:identifier
tag is not polyvalent like the rdf:Description
tag is: it always is a child element of the record's enclosing rdf:Description
element. One way to glimpse the dcterms:identifier
tag in context is to use its contents to sort the records:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="xml" indent="true"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:sort select="dcterms:identifier[starts-with(text(),'urn')][1]"/>
<xsl:copy-of select="."/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The XSL transformation (named identifier.xsl
below) sorts the records based on the first of the dcterms:identifier elements to repeat an ISBN or an International Standard Serial Number (ISSN) found earlier in the record in a bibo:isbn10
, bibo:isbn13
, or bibo:issn
element. It can be piped to the less terminal in order to view the output:
java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:identifier.xsl -s:BNBBasic_202105_f60.rdf | less
One can press shift
+ g
to page to the very end of the file, enter the /
operator at the less
terminal's prompt followed by dcterms:identifier
, and page backwards through the file after the first match is located by pressing shift
+ n
. One pages to the next match by just pressing n
, and returns to the very beginning of the file by just pressing g
.
What observation suggested to me was that one usually shall seek out the dcterms:identifier
tag either to locate a control number prefixed with (Uk)
, like one would expect to find in the Library's MARC records, or to locate the BNB number, an accession number prefixed with GB
and specifically used in BNB, the format of which is explained here. The BNB number basically indicates the year that the record was created. To sort a BNB file so that the records are in order by BNB number, one can use this XSL transformation:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xsl:output method="xml" indent="true"/>
<xsl:template match="/rdf:RDF">
<xsl:for-each select="rdf:Description">
<xsl:sort select="dcterms:identifier[starts-with(text(),'GB')][1]"/>
<xsl:copy-of select="."/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Ultimately it follows, or very nearly follows, consequent on an inductive leap of faith that the document order of the BNB and BLIC files at our disposal as pertains to the bibliographic records is determined by the control number that is prefixed with (Uk)
. When I polled the files in the Library's release of BNB for May 2021, the records in each and every file were indeed in order based on the control number. In BLIC, the same did not pertain where the files BLICBasicB_202105_f03.rdf
through BLICBasicB_202105_f10.rdf
were concerned, but it did for all the rest.
So: what to do with the millions of records? You can pretty much do whatever XSL transformations possible from version 2 forward permit you to do, though it won't necessarily be quick (the example below took fifteen minutes on my computer). Looping through all of BNB (for instance) is pretty easy (I've chosen the XSL transformation discussed earlier that I recommended saving for later use):
#!/bin/bash
for i in {1..9}
do
cat BNBBasic_202105_f0${i}.rdf | iconv -c | java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:abstracts.xsl -s:/dev/stdin >> abstracts
done
for i in {10..65}
do
cat BNBBasic_202105_f${i}.rdf | iconv -c | java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:abstracts.xsl -s:/dev/stdin >> abstracts
done
Please note that the above loops through an older BNB release: the input is piped to the iconv
command with the -c
option in order to forestall errors.
If you wish to do the same XSL transformation each time that the Library makes a new release available, you don't really have to extract all of the files first. The following extracts each BNB file but doesn't save it to a device once that the command or the script has finished:
#!/bin/bash
for i in {1..9}
do
unzip -p -q BNBBasic_202207_rdf.zip BNBBasic_202207_f0${i}.rdf | java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:abstracts.xsl -s:/dev/stdin >> abstracts
done
for i in {10..68}
do
unzip -p -q BNBBasic_202207_rdf.zip BNBBasic_202207_f${i}.rdf | java -jar /home/curt/SaxonHE10-3J/saxon-he-10.3.jar -xsl:abstracts.xsl -s:/dev/stdin >> abstracts
done
For BNB, looping is a good idea if the release is one that you don't already have because each file will be different to those in the previous ones, at least according to the timestamps.
When I was getting used to the BNB and BLIC downloads, the strategy for looking at the data that ultimately seemed to me to hold the most promise entailed serializing the data with a view to displaying it in a GUI application (mentioned in the present article's introductory remarks). The application obviously could not read all of the data in at once. I parsed the British National Bibliography numbers in order to group the records by year. (I'm not certain that I found a satisfactory means to do similarly for BLIC: an upper threshold was set instead, which doesn't quite take account of the fact that more recent records are a greater number of bytes). It was not impossible for the application to still search all of either BNB or BLIC (there was one application for each) at a stroke and display all of the results.
I didn't serialize the XML markup (or have the GUI application subsequently parse any). Instead, I serialized each record as a single string, with a combination of unusual Unicode characters and repetitions of the space character serving as the field delimiters. That can be done in Java or C# by using APIs that "tunnel" through the XML: the code consists largely of branches for each of the tags one has identified. Where XML vocabularies were used here to retrieve data (XSLT in this article, X-definition elsewhere in this project), the GUI application instead relied on Java regular expressions.
One challenge that was to do with the data to hand was the enormous variety of Unicode characters represented. As not wanting interoperability to be limited to what the Java programming language makes possible, I looked for a free means of making PDF output and added it to the application. Each time that there was a new release, it was necessary to poll all of the characters in order to make certain that the PDFs would not be full of placeholders.
The most I know to say in conclusion is that I certainly never had anything like all of this when I attended library college thirty years ago!