VTD-XML: The Future of XML Processing / Tickets / #14 Unexpected result reading node text from a Base64 CDATA value

jimmy zhang - 2017-08-26

Thanks! will look into it and get back

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2017-08-26

status: open --> pending

assigned_to: jimmy zhang
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2017-08-27

ok.. this is actually not a bug... vtd will break CDATA or char data longer than 1MB into multiple tokens of maximum 1MB each... and when you call toString()... VTD will automatically assemble those tokens back into the one of desired length.. the problem now is that you are calling toString on the first token (which already assembles teh entire string for you), then you move on to the next token and merge with the string representation of the second token (which could include the third or fourth and tokens thereafter)... so use while loop in this ... it is not a bug, but expected behavior

let me know what you think...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Yeager - 2017-08-27

Hi Jimmy,

Thanks, that makes sense, but what about when an XML node has mixed content like this:

<root>Some mixed<child>child text</child> content text</root>

The VTDNav.getText() doesn't handle mixed content (VTDNav.getText() just returns "Some mixed" for the XML above instead of "Some mixed content text"), which is why I attempted to handle with my while() loop.

So am I hitting a situation where a routine that might work for mixed content then breaks when it hits large CDATA > 1MB?

Can you please tell me the right way to grab all node text for every situation?

Thanks!

Robert

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2017-08-27

you won't need to worry about this in XPath since 2.13... there is a new method that handles mixed content XPath according to the spex

you could use a new method called getXPathStringVal(). Could you take a look and let me know what you think?

Last edit: jimmy zhang 2017-08-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Yeager - 2017-08-28

Hi Jimmy,

I tried getXPathStringVal() as follows:

VTDNav vn = vg.getNav(); if (vn.matchElement("root")) { System.out.println("getXPathStringVal() val = "+vn.getXPathStringVal()); }

Here is the test XML:

<root>Some mixed<child>child text</child>content text</root>

Here is the output:

getXPathStringVal() val = Some mixedchild textcontent text

I don't understand why it also grabbed the child node's text. I was expecting it would emit "Some mixedcontent text"

Attached is the new test case.

I also tried getXPathStringVal() with the Base64 test case. Strangely, it also doesn't work, the >1MB Base64 test returns seemingly just the first roughly 1MB of data:

TestVTDBase64_pass input xml length = 2833781
getXPathStringVal() length = 1048575

Attached is the revised Base64 test case using getXPathStringVal()

Thank you for helping,

Robert

TestVTDBase64.zip

TestVTDMixed.zip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2017-08-31

Sorry I may have misunderstood your question... to access mixed content huge CDATA text... you may want to resort to XPath or an otherwise little-known class called TextIter() it is part of the VTD-XML api... This class iterates through all text nodes of an element. VTDNav has getText() which is inadequate for mixed content style of XML. text nodes include character_data and CDATA. Since version 2.8, selectText(), selectComment(), and selectPI() were added ... you can basically instantiate the class... select the type of leaf nodes you want to return call touch to initiate the action (from which point .. typically the current element.. the mixed content child nodes will be searched... and call getNext(), make sure you wrap it in a while loop so you don't miss anything...
but you can just use XPath, which automates the whole process... either way, your choice...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Yeager - 2017-09-02

Hi Jimmy,

I have put together a comprehensive test, attached, that demonstrates various ways in which the VTD-XML parser methods provide varying results when the goal is to fetch all the text of a node.

The test case consists of a standard single text < 1MB of data, a large CDATA value > 2MB, and a mixed content test.

For each input XML, the code tests 4 ways of fetching the node text:

xpath text()

VTDNav getXPathStringVal()

VTDNav getTokenType() loop

TextIter

Here is the output from the attached test cases:

TestVTDBase64_pass input xml length = 48489
xpath text() length = 48472
getXPathStringVal() length = 48472
navigator loop length = 48472
TextIter loop length = 48472

TestVTDBase64_fail input xml length = 2833781
xpath text() length = 5355531
getXPathStringVal() length = 1048575
navigator loop length = 5355531
TextIter loop length = 2833752

Mixed content xml = <root>Some mixed<child>child text</child>content text</root>
xpath text() mixed result = Some mixedcontent text
getXPathStringVal() result = Some mixedchild textcontent text
navigator loop result = Some mixed
TextIter loop mixed result = Some mixedcontent text

As can be seen, the only TextIter correctly retrieves all node text for all cases correctly. Thank goodness TextIter will meet my needs. You may want to fix or document the other methods, b/c it was not apparent to me that they had limitations or were broken.

Robert

VTDTest.zip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Yeager - 2017-09-02

Attaching a revised test case that just makes the code more correct, to remove the t++ increment within the loop for the TextIter approach:

while ((t = ti.getNext()) != -1) { nodeText += vn.toString(t++); } while ((t = ti.getNext()) != -1) { nodeText += vn.toString(t); }

VTDTest.zip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unexpected result reading node text from a Base64 CDATA value

Milestone

Searches

Help

#14 Unexpected result reading node text from a Base64 CDATA value

Discussion