Hi,
I'm a fan of VTD-XML, have been using it for many years.
I've seemingly hit a bug with extracting the node text of a Base64-encoded CDATA value.
This is the code I use to grab node text:
String nodeText = ""; while (vn.getTokenType(t) == 5 || vn.getTokenType(t) == 11) { nodeText += vn.toString(t++); }
I have two test cases, one that works as expected and one that returns a node text value that is almost twice the size of the input XML, using the loop above. The only difference between the test cases is the Base64-encoded value.
The successful test case indicates a single token type of "5" is returned. The failure test case shows that multiple token types of "11" are being read by the loop.
The included Java test file has instructions at the top to compile and execute the test case.
Kindly let me know if this is a defect in VTD or if my loop above is not the correct way to read node text.
Thank you!
Robert
Thanks! will look into it and get back
ok.. this is actually not a bug... vtd will break CDATA or char data longer than 1MB into multiple tokens of maximum 1MB each... and when you call toString()... VTD will automatically assemble those tokens back into the one of desired length.. the problem now is that you are calling toString on the first token (which already assembles teh entire string for you), then you move on to the next token and merge with the string representation of the second token (which could include the third or fourth and tokens thereafter)... so use while loop in this ... it is not a bug, but expected behavior
let me know what you think...
Hi Jimmy,
Thanks, that makes sense, but what about when an XML node has mixed content like this:
<root>Some mixed<child>child text</child> content text</root>
The VTDNav.getText() doesn't handle mixed content (VTDNav.getText() just returns "Some mixed" for the XML above instead of "Some mixed content text"), which is why I attempted to handle with my while() loop.
So am I hitting a situation where a routine that might work for mixed content then breaks when it hits large CDATA > 1MB?
Can you please tell me the right way to grab all node text for every situation?
Thanks!
Robert
you won't need to worry about this in XPath since 2.13... there is a new method that handles mixed content XPath according to the spex
you could use a new method called getXPathStringVal(). Could you take a look and let me know what you think?
Last edit: jimmy zhang 2017-08-27
Hi Jimmy,
I tried getXPathStringVal() as follows:
Here is the test XML:
<root>Some mixed<child>child text</child>content text</root>
Here is the output:
getXPathStringVal() val = Some mixedchild textcontent text
I don't understand why it also grabbed the child node's text. I was expecting it would emit "Some mixedcontent text"
Attached is the new test case.
I also tried getXPathStringVal() with the Base64 test case. Strangely, it also doesn't work, the >1MB Base64 test returns seemingly just the first roughly 1MB of data:
TestVTDBase64_pass input xml length = 2833781
getXPathStringVal() length = 1048575
Attached is the revised Base64 test case using getXPathStringVal()
Thank you for helping,
Robert
Sorry I may have misunderstood your question... to access mixed content huge CDATA text... you may want to resort to XPath or an otherwise little-known class called TextIter() it is part of the VTD-XML api... This class iterates through all text nodes of an element. VTDNav has getText() which is inadequate for mixed content style of XML. text nodes include character_data and CDATA. Since version 2.8, selectText(), selectComment(), and selectPI() were added ... you can basically instantiate the class... select the type of leaf nodes you want to return call touch to initiate the action (from which point .. typically the current element.. the mixed content child nodes will be searched... and call getNext(), make sure you wrap it in a while loop so you don't miss anything...
but you can just use XPath, which automates the whole process... either way, your choice...
Hi Jimmy,
I have put together a comprehensive test, attached, that demonstrates various ways in which the VTD-XML parser methods provide varying results when the goal is to fetch all the text of a node.
The test case consists of a standard single text < 1MB of data, a large CDATA value > 2MB, and a mixed content test.
For each input XML, the code tests 4 ways of fetching the node text:
Here is the output from the attached test cases:
TestVTDBase64_pass input xml length = 48489
xpath text() length = 48472
getXPathStringVal() length = 48472
navigator loop length = 48472
TextIter loop length = 48472
TestVTDBase64_fail input xml length = 2833781
xpath text() length = 5355531
getXPathStringVal() length = 1048575
navigator loop length = 5355531
TextIter loop length = 2833752
Mixed content xml = <root>Some mixed<child>child text</child>content text</root>
xpath text() mixed result = Some mixedcontent text
getXPathStringVal() result = Some mixedchild textcontent text
navigator loop result = Some mixed
TextIter loop mixed result = Some mixedcontent text
As can be seen, the only TextIter correctly retrieves all node text for all cases correctly. Thank goodness TextIter will meet my needs. You may want to fix or document the other methods, b/c it was not apparent to me that they had limitations or were broken.
Robert
Attaching a revised test case that just makes the code more correct, to remove the t++ increment within the loop for the TextIter approach: