[Xmlstar-devel] option: "--input-file=<xml-input-file>"

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

(apologies for multiple copies - it seems the first got stuck)

Dear all!

XMLStarlet is an excellent tool - I ran across it recently and started 
to use it quite heavily soon afterwards.  It provides the kind of 
glue that I was missing between line-based Unix Piping and XML.

I wonder, though, how you can utilise "xml sel" in the following 
scenario:
- one big XML file ("data.xml")
- a list of words, one per line, in a plain text file ("words")

task: select something from the XML file for each word, e.g.

for WORD in `cat words`; do
  xml sel -t -m "//word[@form='$WORD']" -v 'text()' -n data.xml;
done

This is o.k., but very inefficient: you'll have to load data.xml once 
for each word.  If data.xml is very large and the list of words is 
long, this can become tedious rather quickly.  This line of argument 
also holds for inserting data with "xml ed" etc.

Alternatively, you could use "xargs".  I often use it to speed up 
similar scenarios: feed one program with many ops to carry out at 
once, instead of restarting it over and over again.  However, you run 
into a problem with "xml sel", because the input file has to be 
either on STDIN or the *last* argument on the command line.  

With an option like "--input-file=<xml-input-file>" you could write:
(the rather cryptic perl one-liner constructs one select expression 
per line in "words")

cat words \
| perl -pe 'chomp; s!^!-t -m "//word[@form=&quot;!; s!$!&quot;]" -v 
"text()" -n\x00!' \
| xargs -0 xml sel --input-file=data.xml

This would construct a command line consisting of as many operations 
as fit in one call, and only repeat the command when there are too 
many operations.  It seems that XMLStarlet scales well with that many 
options internally according to some tests I did.  

For the moment, I resort to a wrapper that just filters out 
"--input-file=<file>" and leaves the rest unchanged, which is not 
very elegant.  What is more, it is problematic exactly for very long 
lines, where it is not clear if the filename will still fit on the 
line.  BTW I got the idea from the option "--target-directory=<dir>" 
of "mv" to specify the target dir, which I found useful for similar 
reasons.

I might be able to dig into the sources and provide a patch, but I was 
interested to hear your opinion first - and whether I did not notice 
any obvious solutions.  I'd be happy to hear any thoughts on this.

Thanks,
Tylman