Thread: [Xml-coreutils-discuss] Where to start?
Status: Alpha
Brought to you by:
lbreyer
From: Douglas H. <do...@do...> - 2015-02-11 19:00:15
Attachments:
page1.xml
|
I need some help getting my thinking in the xml-coreutils world. I have some prepared data in the format: <response> <listing> <foo>foo</foo> <bar>bar</bar> <floor_plan>http://example.com/123456</floor_plan> <listing_id>1</listing_id> </listing> <listing> <listing_id>2</listing_id> <foo>foo</foo> <bar>bar</bar> </listing> </response> Each /response/listing element has a //listing_id and some, but not all listings have a floor_plan element. Goal: I would like to extract only listings with floor plans, and only the selected elements I am interested in, into a new document as: <listings> <listing> <listing_id>1</listing_id> <floor_plan>http://example.com/123456</floor_plan> </listing> <listings> I have tried commands such as: # Create the target file xml-echo -e "[listings@updated=20150210]" >listings.xml # Copy selected elements into target xml-cp page1.xml :/response/listing/listing_id[/response/listing/floor_plan != null] listings.xml :/listings/ I have read all the man pages and experimented with many different of the xml-* commands. Very seldom do they work as I am hoping. The workflow I would expect, based on the coreutils workflows I normally use, would be: cat source.xml | while read element; do if echo $element | grep -q floor_plan ; then echo $element >> target.xml fi done It would be nice if I could use a complement of the above technologies for processing the XML. The below are imaginary pseudo-commands: xml-cat source.xml :/response/listing | while xml-read listing; do # $listing is now an xml fragment of one listing element's full content if echo $listing | xml-grep -q ://floor_plan; then cat $element | xml-egrep "://listing|://listing/listing_id|://listing/floor_plan" | xml-insert target.xml :/listings/ fi done Perhaps by following my pseudo-logic, you can explain how I can carry out these operations with xml-coreutils. I have attached one of my source files. Regards, Doug -- Douglas Held do...@do... +447775733093 |
From: Douglas H. <do...@do...> - 2015-02-11 20:19:22
|
Maybe I am just criminally insane, but I could solve my problem with the following Java program. I would prefer however to learn to quickly use the xml-coreutils command line utilities... import java.io.File; import java.io.IOException; import java.io.OutputStream; import java.io.OutputStreamWriter; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.transform.OutputKeys; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerException; import javax.xml.transform.TransformerFactory; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Node; import org.w3c.dom.NodeList; public class Search { public static void main(String[] args) throws Exception { Document output = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument(); output.appendChild( output.createElement( "listings" ) ); System.out.println( output.getDocumentElement().toString() ); Document input = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse( new File( "/Users/douglasheld/zoopla/listings.xml" ) ); NodeList listings = input.getElementsByTagName( "listing" ); for ( int i=0; i<listings.getLength(); i++ ){ Node listing = listings.item( i ); boolean insert = false; Element newListing = output.createElement( "listing" ); for ( int j=0; j < listing.getChildNodes().getLength(); j++){ if ( listing.getChildNodes().item(j).getNodeName() == "listing_id" ){ newListing.setAttribute("id", listing.getChildNodes().item(j).getFirstChild().getNodeValue() ); } if ( listing.getChildNodes().item(j).getNodeName() == "floor_plan" ){ insert = true; Element newFloorPlan = output.createElement( "floor_plan" ); newFloorPlan.setTextContent( listing.getChildNodes().item(j).getFirstChild().getNodeValue() ); newListing.appendChild( newFloorPlan ); } } if ( insert ){ output.getDocumentElement().appendChild( newListing ); System.err.print( '.' ); } } printDocument( output, System.out ); } /* copy/paste from http://stackoverflow.com/questions/2325388/java-shortest-way-to-pretty-print-to-stdout-a-org-w3c-dom-document */ public static void printDocument(Document doc, OutputStream out) throws IOException, TransformerException { TransformerFactory tf = TransformerFactory.newInstance(); Transformer transformer = tf.newTransformer(); transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no"); transformer.setOutputProperty(OutputKeys.METHOD, "xml"); transformer.setOutputProperty(OutputKeys.INDENT, "yes"); transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); transformer.setOutputProperty("{ http://xml.apache.org/xslt}indent-amount", "4"); transformer.transform(new DOMSource(doc), new StreamResult(new OutputStreamWriter(out, "UTF-8"))); } } On Wed, Feb 11, 2015 at 6:59 PM, Douglas Held <do...@do...> wrote: > I need some help getting my thinking in the xml-coreutils world. > > I have some prepared data in the format: > <response> > <listing> > <foo>foo</foo> > <bar>bar</bar> > <floor_plan>http://example.com/123456</floor_plan> > <listing_id>1</listing_id> > </listing> > <listing> > <listing_id>2</listing_id> > <foo>foo</foo> > <bar>bar</bar> > </listing> > </response> > > Each /response/listing element has a //listing_id and some, but not all > listings have a floor_plan element. > > Goal: > I would like to extract only listings with floor plans, and only the > selected elements I am interested in, into a new document as: > <listings> > <listing> > <listing_id>1</listing_id> > <floor_plan>http://example.com/123456</floor_plan> > </listing> > <listings> > > I have tried commands such as: > # Create the target file > xml-echo -e "[listings@updated=20150210]" >listings.xml > # Copy selected elements into target > xml-cp page1.xml > :/response/listing/listing_id[/response/listing/floor_plan != null] > listings.xml :/listings/ > > I have read all the man pages and experimented with many different of the > xml-* commands. Very seldom do they work as I am hoping. > > The workflow I would expect, based on the coreutils workflows I normally > use, would be: > cat source.xml | while read element; do > if echo $element | grep -q floor_plan ; then > echo $element >> target.xml > fi > done > > It would be nice if I could use a complement of the above technologies for > processing the XML. The below are imaginary pseudo-commands: > > xml-cat source.xml :/response/listing | while xml-read listing; do > # $listing is now an xml fragment of one listing element's full content > if echo $listing | xml-grep -q ://floor_plan; then > cat $element | xml-egrep > "://listing|://listing/listing_id|://listing/floor_plan" | xml-insert > target.xml :/listings/ > fi > done > > Perhaps by following my pseudo-logic, you can explain how I can carry out > these operations with xml-coreutils. > > I have attached one of my source files. > > Regards, > Doug > -- > Douglas Held > do...@do... > +447775733093 > -- Douglas Held do...@do... +447775733093 |
From: <la...@lb...> - 2015-02-12 14:21:16
|
On 2015-02-12 07:18, Douglas Held wrote: > Maybe I am just criminally insane, but I could solve my problem with > the > following Java program. I would prefer however to learn to quickly use > the > xml-coreutils command line utilities... If you're not afraid of complexity, I would probably try an XSLT processor instead... Laird Breyer |
From: Douglas H. <do...@do...> - 2015-02-11 20:20:32
|
for posterity, both of my == operators are erroneous and should be replaced with .equals() On Wed, Feb 11, 2015 at 8:18 PM, Douglas Held <do...@do...> wrote: > Maybe I am just criminally insane, but I could solve my problem with the > following Java program. I would prefer however to learn to quickly use the > xml-coreutils command line utilities... > > import java.io.File; > import java.io.IOException; > import java.io.OutputStream; > import java.io.OutputStreamWriter; > > import javax.xml.parsers.DocumentBuilderFactory; > import javax.xml.transform.OutputKeys; > import javax.xml.transform.Transformer; > import javax.xml.transform.TransformerException; > import javax.xml.transform.TransformerFactory; > import javax.xml.transform.dom.DOMSource; > import javax.xml.transform.stream.StreamResult; > > import org.w3c.dom.Document; > import org.w3c.dom.Element; > import org.w3c.dom.Node; > import org.w3c.dom.NodeList; > > > public class Search { > > public static void main(String[] args) throws Exception { > Document output = > DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument(); > output.appendChild( output.createElement( "listings" ) ); > System.out.println( output.getDocumentElement().toString() ); > > Document input = > DocumentBuilderFactory.newInstance().newDocumentBuilder().parse( new File( > "/Users/douglasheld/zoopla/listings.xml" ) ); > NodeList listings = input.getElementsByTagName( "listing" ); > for ( int i=0; i<listings.getLength(); i++ ){ > Node listing = listings.item( i ); > boolean insert = false; > Element newListing = output.createElement( "listing" ); > for ( int j=0; j < listing.getChildNodes().getLength(); j++){ > if ( listing.getChildNodes().item(j).getNodeName() == > "listing_id" ){ > newListing.setAttribute("id", > listing.getChildNodes().item(j).getFirstChild().getNodeValue() ); > } > if ( listing.getChildNodes().item(j).getNodeName() == > "floor_plan" ){ > insert = true; > Element newFloorPlan = output.createElement( > "floor_plan" ); > newFloorPlan.setTextContent( > listing.getChildNodes().item(j).getFirstChild().getNodeValue() ); > newListing.appendChild( newFloorPlan ); > } > } > if ( insert ){ > output.getDocumentElement().appendChild( newListing ); > System.err.print( '.' ); > } > } > printDocument( output, System.out ); > } > > /* copy/paste from > http://stackoverflow.com/questions/2325388/java-shortest-way-to-pretty-print-to-stdout-a-org-w3c-dom-document > */ > public static void printDocument(Document doc, OutputStream out) > throws IOException, TransformerException { > TransformerFactory tf = TransformerFactory.newInstance(); > Transformer transformer = tf.newTransformer(); > transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, > "no"); > transformer.setOutputProperty(OutputKeys.METHOD, "xml"); > transformer.setOutputProperty(OutputKeys.INDENT, "yes"); > transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); > transformer.setOutputProperty("{ > http://xml.apache.org/xslt}indent-amount", "4"); > > transformer.transform(new DOMSource(doc), > new StreamResult(new OutputStreamWriter(out, "UTF-8"))); > } > } > > > On Wed, Feb 11, 2015 at 6:59 PM, Douglas Held <do...@do...> > wrote: > >> I need some help getting my thinking in the xml-coreutils world. >> >> I have some prepared data in the format: >> <response> >> <listing> >> <foo>foo</foo> >> <bar>bar</bar> >> <floor_plan>http://example.com/123456</floor_plan> >> <listing_id>1</listing_id> >> </listing> >> <listing> >> <listing_id>2</listing_id> >> <foo>foo</foo> >> <bar>bar</bar> >> </listing> >> </response> >> >> Each /response/listing element has a //listing_id and some, but not all >> listings have a floor_plan element. >> >> Goal: >> I would like to extract only listings with floor plans, and only the >> selected elements I am interested in, into a new document as: >> <listings> >> <listing> >> <listing_id>1</listing_id> >> <floor_plan>http://example.com/123456</floor_plan> >> </listing> >> <listings> >> >> I have tried commands such as: >> # Create the target file >> xml-echo -e "[listings@updated=20150210]" >listings.xml >> # Copy selected elements into target >> xml-cp page1.xml >> :/response/listing/listing_id[/response/listing/floor_plan != null] >> listings.xml :/listings/ >> >> I have read all the man pages and experimented with many different of the >> xml-* commands. Very seldom do they work as I am hoping. >> >> The workflow I would expect, based on the coreutils workflows I normally >> use, would be: >> cat source.xml | while read element; do >> if echo $element | grep -q floor_plan ; then >> echo $element >> target.xml >> fi >> done >> >> It would be nice if I could use a complement of the above technologies >> for processing the XML. The below are imaginary pseudo-commands: >> >> xml-cat source.xml :/response/listing | while xml-read listing; do >> # $listing is now an xml fragment of one listing element's full content >> if echo $listing | xml-grep -q ://floor_plan; then >> cat $element | xml-egrep >> "://listing|://listing/listing_id|://listing/floor_plan" | xml-insert >> target.xml :/listings/ >> fi >> done >> >> Perhaps by following my pseudo-logic, you can explain how I can carry out >> these operations with xml-coreutils. >> >> I have attached one of my source files. >> >> Regards, >> Doug >> -- >> Douglas Held >> do...@do... >> +447775733093 >> > > > > -- > Douglas Held > do...@do... > +447775733093 > -- Douglas Held do...@do... +447775733093 |
From: <la...@lb...> - 2015-02-12 14:03:21
|
On 2015-02-12 07:20, Douglas Held wrote: >>> Goal: >>> I would like to extract only listings with floor plans, and only the >>> selected elements I am interested in, into a new document as: >>> <listings> >>> <listing> >>> <listing_id>1</listing_id> >>> <floor_plan>http://example.com/123456</floor_plan> >>> </listing> >>> <listings> >>> >>> I have tried commands such as: >>> # Create the target file >>> xml-echo -e "[listings@updated=20150210]" >listings.xml >>> # Copy selected elements into target >>> xml-cp page1.xml >>> :/response/listing/listing_id[/response/listing/floor_plan != null] >>> listings.xml :/listings/ I would not expect this to succeed as the condition in [] requires more advanced processing than xml-cp is currently coded to do. Perhaps an extension of xml-grep could handle this. I will have to think about it. Cheers, Laird Breyer |
From: <la...@lb...> - 2015-02-12 13:48:42
|
Hi Douglas, Thanks for your questions. I'll answer them each separately. The sample xml document you provided was very helpful. First, the following one: > Each /response/listing element has a //listing_id and some, but not all > listings have a floor_plan element. > > Goal: > I would like to extract only listings with floor plans, and only the > selected elements I am interested in, into a new document as: > <listings> > <listing> > <listing_id>1</listing_id> > <floor_plan>http://example.com/123456</floor_plan> > </listing> > <listings> You can extract the listing nodes as temporary xml files by using xml-find, eg xml-find page1.xml :/response/* -exec echo {-} ';' The {-} is the name of a temporary file which exists only as long as the echo command runs. If you had a script cmd.sh instead of echo, the script could open the file name {-} and process it. Alternatively, you can run bash executing a string of commands, like so: xml-find page1.xml :/response/* -exec bash -c 'if grep -q floor_plan {-} ; then xml-printf "id %s\nagent %s\n" {-} ://listing_id ://agent_name; fi' ';' (note the single quotes around the semicolon). The above will print plain text if the listing_id and agent_name exist, otherwise you'll get some error messages on stderr and no output. You could replace the plain grep with if xml-grep '.*' {-} ://floor_plan >/dev/null ; then ..... Also, if you prefer to have xml output, perhaps try xml-find page1.xml ://response/* -exec xml-grep '.*' {-} ://floor_plan ://listing_id ://agent_name ';' | xml-cat Here xml-grep outputs an xml fragment for each temporary file, and xml-cat reassembles the fragments into a single xml file. These ideas are probably the closest to the workflow you suggest below. > The workflow I would expect, based on the coreutils workflows I > normally > use, would be: > cat source.xml | while read element; do > if echo $element | grep -q floor_plan ; then > echo $element >> target.xml > fi > done > > It would be nice if I could use a complement of the above technologies > for > processing the XML. The below are imaginary pseudo-commands: > > xml-cat source.xml :/response/listing | while xml-read listing; do > # $listing is now an xml fragment of one listing element's full > content > if echo $listing | xml-grep -q ://floor_plan; then > cat $element | xml-egrep > "://listing|://listing/listing_id|://listing/floor_plan" | xml-insert > target.xml :/listings/ > fi > done Cheers, Laird Breyer |