Re: [Xml-coreutils-discuss] Where to start?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Douglas,

Thanks for your questions. I'll answer them each separately.
The sample xml document you provided was very helpful.

First, the following one:

> Each /response/listing element has a //listing_id and some, but not all
> listings have a floor_plan element.
> 
> Goal:
> I would like to extract only listings with floor plans, and only the
> selected elements I am interested in, into a new document as:
> <listings>
>  <listing>
>    <listing_id>1</listing_id>
>    <floor_plan>http://example.com/123456</floor_plan>
>  </listing>
> <listings>

You can extract the listing nodes as temporary xml files by using 
xml-find, eg

xml-find page1.xml :/response/* -exec echo {-} ';'

The {-} is the name of a temporary file which exists only as long as the 
echo command runs.
If you had a script cmd.sh instead of echo, the script could open the 
file name {-} and process it.
Alternatively, you can run bash executing a string of commands, like so:

xml-find page1.xml :/response/* -exec bash -c 'if grep -q floor_plan {-} 
; then xml-printf "id %s\nagent %s\n" {-} ://listing_id ://agent_name; 
fi' ';'

(note the single quotes around the semicolon). The above will print 
plain text if the listing_id and agent_name exist, otherwise you'll get 
some error messages on stderr and no output. You could replace the plain 
grep with

  if xml-grep '.*' {-} ://floor_plan >/dev/null ; then .....

Also, if you prefer to have xml output, perhaps try

xml-find page1.xml ://response/*  -exec xml-grep '.*' {-} ://floor_plan 
://listing_id ://agent_name ';' | xml-cat

Here xml-grep outputs an xml fragment for each temporary file, and 
xml-cat reassembles the fragments into a single xml file. These ideas 
are probably the closest to the workflow you suggest below.

> The workflow I would expect, based on the coreutils workflows I 
> normally
> use, would be:
> cat source.xml | while read element; do
>    if echo $element | grep -q floor_plan ; then
>      echo $element >> target.xml
>    fi
> done
> 
> It would be nice if I could use a complement of the above technologies 
> for
> processing the XML.  The below are imaginary pseudo-commands:
> 
> xml-cat source.xml :/response/listing | while xml-read listing; do
>  # $listing is now an xml fragment of one listing element's full 
> content
>    if echo $listing | xml-grep -q ://floor_plan; then
>      cat $element | xml-egrep
> "://listing|://listing/listing_id|://listing/floor_plan" | xml-insert
> target.xml :/listings/
>    fi
> done

Cheers,
Laird Breyer