From: Ming C. <cim...@ya...> - 2011-09-26 10:22:15
|
Hi, Recently I run the XmlStarlet tool on my Win7 OS with different Unicode XML files, and found some strange things: 1. XmlStarlet supports UTF-16BE (no BOM, encoding is UTF-16BE), with command line like "xml ed -d //d UTF-16BE.xml > d-UTF-16BE.xml", the output file has UTF-16BE format, the line endings becomes 00 0D 0A whatever the original line endings are (00 0D 00 0A, 00 0A or 00 0D). 2. XmlStarlet supports UTF-16LE (no BOM, encoding is UTF-16LE), with command line like "xml ed -d //d UTF-16LE.xml > d-UTF-16LE.xml", the output file has UTF-16LE format, the line endings becomes 0D 0A 00 whatever the original line endings are (0D 00 0A 00, 0A 00 or 0D 00). 3. XmlStarlet supports UTF-16LE-BOM (BOM FF FE, encoding is UTF-16), with command line like "xml ed -d //d UTF-16LE-BOM.xml > d-UTF-16LE-BOM.xml", the output file has UTF-16LE-BOM format, the line endings becomes 0D 0A 00 whatever the original line endings are (0D 00 0A 00, 0A 00 or 0D 00). 4. XmlStarlet supports UTF-16BE-BOM (BOM FE FF, encoding is UTF-16), with command line like "xml ed -d //d UTF-16BE-BOM.xml > d-UTF-16BE-BOM.xml", the output file has UTF-16LE-BOM format, the line endings becomes 0D 0A 00 whatever the original line endings are (00 0D 00 0A, 00 0A or 00 0D). Please note that the case #4, in which the output file has an reversed byte order. In all output files, the line endings have strange format. I think they should be 00 0D 00 0A or 0D 00 0A 00. I am using the latest windows version (1.2.1). My OS is Win7 64bit. Could someone help to check this? Thanks, Ming |
From: Ming C. <cim...@ya...> - 2011-10-10 05:13:25
|
Hi Noam, Thanks for the new release, 1.3.0. It unified the line endings to 0A 00 (or 00 0A) for all UTF16 format files. It meets my current requirements although it would be better to have the 00 0D 00 0A (or 0D 00 0A 00) line ending on Windows (I understand that is controlled by the C runtime). Best regards, Ming ________________________________ From: Noam Postavsky <npo...@us...> To: Ming Chen <cim...@ya...>; xml...@li... Sent: Sunday, October 2, 2011 12:08 AM Subject: Re: [Xmlstar-devel] Line ending issue for unicode XML files Ming Chen <cim...@ya...> writes: > I basically understood what has happened. So is there a good solution for this? > I didn't get the picture of what you said about mingw and binary > mode... There is a semi-good solution: switch stdout to binary (as opposed to text) mode so that the ascii carriage returns won't be added. Ideally I would want to switch stdout to UTF16-text mode but this is only possible with a newer version of the C runtime. I don't want to use this newer version because it is not guaranteed to be installed on Windows so it would have to be packaged with XMLStarlet, and also it's not supported by mingw (gcc for Windows). Noam |
From: chen m. <cim...@ya...> - 2012-01-11 07:24:20
|
I used "xml ed -d /descendant::node()/rec/child::node()[1] test.xml > result.xml Thanks, Ming 已通过MOTOBLUR™连接 -----原始信息----- From: Noam Postavsky <npo...@us...> To: Ming Chen <cim...@ya...> 抄送: "xml...@li..." <xml...@li...> 已发送: 2012 1月, 周二, 10 16:26:59 格林尼治标准时间+0000 主题: Re: Should the first node be text node for xpath: /descendant::node()/rec/child::node()[1] On Tue, Jan 10, 2012 at 12:05 AM, Ming Chen <cim...@ya...> wrote: > While my original expected outputs (which really is XmlStar and > XPathBuilder's output) are: "<para type="error" position="1"/><para > type="warning" position="1"/><para type="info" position="1"/>", the > XmlLint and XPathTester give some empty lines actually. I get blank lines from XMLStarlet as well, which is not surprising since it's based on the same code as xmllint. What command line arguments did you use for XMLStarlet? I used xml sel -t -c "/descendant::node()/rec/child::node()[1]" test.xml Noam |
From: Noam P. <npo...@us...> - 2012-01-12 03:49:26
|
"chen ming" <cim...@ya...> writes: > I used "xml ed -d /descendant::node()/rec/child::node()[1] test.xml Okay, I see what's happening. The behaviour depends on whether libxml's keepBlanks option is set. xml ed has it off by default, but xml sel and xmllint have it on. You can get the other behaviour by passing the appropriate options: xml ed --pf (or -P) ... xml sel --noblanks (or -B) ... xmllint --noblanks ... Noam |
From: Ming C. <cim...@ya...> - 2012-01-12 05:15:05
|
Yeah! That's it. Thank you! Ming ________________________________ From: Noam Postavsky <npo...@us...> To: chen ming <cim...@ya...> Cc: "xml...@li..." <xml...@li...> Sent: Thursday, January 12, 2012 11:48 AM Subject: Re: Should the first node be text node for xpath: /descendant::node()/rec/child::node()[1] "chen ming" <cim...@ya...> writes: > I used "xml ed -d /descendant::node()/rec/child::node()[1] test.xml Okay, I see what's happening. The behaviour depends on whether libxml's keepBlanks option is set. xml ed has it off by default, but xml sel and xmllint have it on. You can get the other behaviour by passing the appropriate options: xml ed --pf (or -P) ... xml sel --noblanks (or -B) ... xmllint --noblanks ... Noam |
From: Noam P. <npo...@us...> - 2011-09-29 18:49:21
|
On Mon, Sep 26, 2011 at 6:22 AM, Ming Chen <cim...@ya...> wrote: > In all output files, the line endings have strange format. I think they should be 00 0D 00 0A or 0D 00 0A 00. > > I am using the latest windows version (1.2.1). My OS is Win7 64bit. > > Could someone help to check this? Yes, I think I see what is happening: the libxml output routine is writing just a newline character and expecting the c-runtime to do the conversion to the system specific line ending. Problem is, nobody told the c-runtime that the output is in UTF16, so it's just waiting until it sees an ascii newline (0A) and then it adds a carriage return (0D). Unfortunately, it seems there is no way to tell Windows to use switch stdout to UTF16 until the VS2005 c-runtime. XMLStarlet uses mingw, so the best I can do is change to binary mode, meaning the line endings will be 00 0A. |
From: Noam P. <npo...@us...> - 2011-10-01 16:09:03
|
Ming Chen <cim...@ya...> writes: > I basically understood what has happened. So is there a good solution for this? > I didn't get the picture of what you said about mingw and binary > mode... There is a semi-good solution: switch stdout to binary (as opposed to text) mode so that the ascii carriage returns won't be added. Ideally I would want to switch stdout to UTF16-text mode but this is only possible with a newer version of the C runtime. I don't want to use this newer version because it is not guaranteed to be installed on Windows so it would have to be packaged with XMLStarlet, and also it's not supported by mingw (gcc for Windows). Noam |
From: Ming C. <cim...@ya...> - 2012-01-10 05:05:50
|
Hi Noam, Recently I get different outputs with those tools: XmlStar, XPathBuilder, XmlLint (win32 version, with libxml2 underneath) and XPathTester (online tool), for below XML file: <?xml version="1.0" encoding="UTF-8"?> <xml> <table> <rec id="1"> <para type="error" position="1"/> <para type="error" position="2"/> <para type="error" position="3"/> </rec> <rec id="2"> <para type="warning" position="1"/> <para type="warning" position="2"/> <para type="warning" position="3"/> </rec> <rec id="3"> <para type="info" position="1"/> <para type="info" position="2"/> <para type="info" position="3"/> </rec> </table> </xml> While my original expected outputs (which really is XmlStar and XPathBuilder's output) are: "<para type="error" position="1"/><para type="warning" position="1"/><para type="info" position="1"/>", the XmlLint and XPathTester give some empty lines actually. I asked this question to the libxml2 mailing list, and got below answers: From: Liam R E Quin <li...@ho...> To: Ming Chen <cim...@ya...> Cc: "xm...@gn..." <xm...@gn...> Sent: Tuesday, January 10, 2012 11:22 AM Subject: Re: [xml] Does not support the expression : /descendant::node()/rec/child::node()[1]? On Mon, 2012-01-09 at 18:54 -0800, Ming Chen wrote: > According to the XPath spec (V2.0 section 3.2.3 Unabbreviated > Syntax) : child::node() selects all the children of the context node. Note > that no attribute nodes are returned, because attributes are not > children. Note, libxml2 actually only supports XPath 1, not XPath 2. However, /descendant::node()/rec/child::node()[1] will match text nodes, and you're getting the blank (whitespace-only) text node that's the first child of elements, since your input is "indented". > Shouldn’t it have the same output as > /descendant::node()/rec/child::*[1] and /descendant::node()/rec/para[1]? No. The first child node in <rec id="1"> <para type="error" position="1"/> is the newline and spaces between id="1"> and <para. Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ I think that explanation is legal. What's your opinion on this? Thanks, Ming |
From: Noam P. <npo...@us...> - 2012-01-10 16:27:09
|
On Tue, Jan 10, 2012 at 12:05 AM, Ming Chen <cim...@ya...> wrote: > While my original expected outputs (which really is XmlStar and > XPathBuilder's output) are: "<para type="error" position="1"/><para > type="warning" position="1"/><para type="info" position="1"/>", the > XmlLint and XPathTester give some empty lines actually. I get blank lines from XMLStarlet as well, which is not surprising since it's based on the same code as xmllint. What command line arguments did you use for XMLStarlet? I used xml sel -t -c "/descendant::node()/rec/child::node()[1]" test.xml Noam |