From: Tylman U. <t....@gm...> - 2005-12-08 16:57:14
|
(apologies for multiple copies - it seems the first got stuck) Dear all! XMLStarlet is an excellent tool - I ran across it recently and started to use it quite heavily soon afterwards. It provides the kind of glue that I was missing between line-based Unix Piping and XML. I wonder, though, how you can utilise "xml sel" in the following scenario: - one big XML file ("data.xml") - a list of words, one per line, in a plain text file ("words") task: select something from the XML file for each word, e.g. for WORD in `cat words`; do xml sel -t -m "//word[@form='$WORD']" -v 'text()' -n data.xml; done This is o.k., but very inefficient: you'll have to load data.xml once for each word. If data.xml is very large and the list of words is long, this can become tedious rather quickly. This line of argument also holds for inserting data with "xml ed" etc. Alternatively, you could use "xargs". I often use it to speed up similar scenarios: feed one program with many ops to carry out at once, instead of restarting it over and over again. However, you run into a problem with "xml sel", because the input file has to be either on STDIN or the *last* argument on the command line. With an option like "--input-file=<xml-input-file>" you could write: (the rather cryptic perl one-liner constructs one select expression per line in "words") cat words \ | perl -pe 'chomp; s!^!-t -m "//word[@form="!; s!$!"]" -v "text()" -n\x00!' \ | xargs -0 xml sel --input-file=data.xml This would construct a command line consisting of as many operations as fit in one call, and only repeat the command when there are too many operations. It seems that XMLStarlet scales well with that many options internally according to some tests I did. For the moment, I resort to a wrapper that just filters out "--input-file=<file>" and leaves the rest unchanged, which is not very elegant. What is more, it is problematic exactly for very long lines, where it is not clear if the filename will still fit on the line. BTW I got the idea from the option "--target-directory=<dir>" of "mv" to specify the target dir, which I found useful for similar reasons. I might be able to dig into the sources and provide a patch, but I was interested to hear your opinion first - and whether I did not notice any obvious solutions. I'd be happy to hear any thoughts on this. Thanks, Tylman |