|
From: Peter W. <pet...@ke...> - 2012-03-15 10:49:38
|
Hi There is a very useful blog by Joe Wicentowski on transforming text into XML which you can find at http://joewiz.posterous.com/an-under-appreciated-use-for-xquery-wrangling. As I have quite a lot of this to do, I started experimenting, and my effort is shown below. This successfully achieves a basic transformation of nearly 200 pages of text with footnotes at the bottom of each page and headers at the top. Incidentally it is useful to look at the text, with code revealed, eg in Word as this shows what is likely to work best when tokenizing. I run into problems when I try to replace specially marked note numbers in the text with the associated notes (I tried using following-sibling::note and matching the numbers without success) and also what to do about amalgating paragraphs crossing page breaks. Again I can see what might work reasonably well, eg paras starting with lower case after page breaks but I'm not sure how set out the XQuery to get there. Maybe I'll need to revert to manual from this point, but would be grateful to hear from anyone who has suggestions as to how I might proceed. Thanks Peter xquery version "1.0"; declare function local:transform-block($block) { (:To replace the note number in the text with a more identifiable marker:) let $text-with-marked-note-number := replace($block, '\s(\d)\s|\s(\d)$',' XXX$1$2 ') (:The next two lines replace the page headers with <pb/>:) return if (matches($text-with-marked-note-number, "THE OKEOVERS OF OKEOVER.*")) then <pb/> else if (matches($text-with-marked-note-number, "\dTHE OKEOVERS OF OKEOVER")) then <pb/> (:To identify footnotes and markup accordingly:) else if (matches($text-with-marked-note-number,'^\d\s')) then <note>{$text-with-marked-note-number}</note> (:Everything else marked up as element p:) else <p>{$text-with-marked-note-number}</p> }; let $file := doc('chsokeover.txt')/root (:Text tokenised on the basis of 3 carriage returns/spaces:) let $content :=tokenize($file, '\s{3,}') return <root> <head>The Okeovers of Okeover.</head> {for $block in $content (:Sends the tokenized text to be transformed:) let $result := local:transform-block($block) return $result} </root> |