Menu

#14 Unexpected result reading node text from a Base64 CDATA value

1.0
pending
None
2017-09-02
2017-08-23
No

Hi,

I'm a fan of VTD-XML, have been using it for many years.

I've seemingly hit a bug with extracting the node text of a Base64-encoded CDATA value.

This is the code I use to grab node text:

                String nodeText = "";
                while (vn.getTokenType(t) == 5 || vn.getTokenType(t) == 11)
                {
                    nodeText += vn.toString(t++);
                }

I have two test cases, one that works as expected and one that returns a node text value that is almost twice the size of the input XML, using the loop above. The only difference between the test cases is the Base64-encoded value.

The successful test case indicates a single token type of "5" is returned. The failure test case shows that multiple token types of "11" are being read by the loop.

The included Java test file has instructions at the top to compile and execute the test case.

Kindly let me know if this is a defect in VTD or if my loop above is not the correct way to read node text.

Thank you!

Robert

1 Attachments

Discussion

  • jimmy zhang

    jimmy zhang - 2017-08-26

    Thanks! will look into it and get back

     
  • jimmy zhang

    jimmy zhang - 2017-08-26
    • status: open --> pending
    • assigned_to: jimmy zhang
     
  • jimmy zhang

    jimmy zhang - 2017-08-27

    ok.. this is actually not a bug... vtd will break CDATA or char data longer than 1MB into multiple tokens of maximum 1MB each... and when you call toString()... VTD will automatically assemble those tokens back into the one of desired length.. the problem now is that you are calling toString on the first token (which already assembles teh entire string for you), then you move on to the next token and merge with the string representation of the second token (which could include the third or fourth and tokens thereafter)... so use while loop in this ... it is not a bug, but expected behavior

    let me know what you think...

     
  • Robert Yeager

    Robert Yeager - 2017-08-27

    Hi Jimmy,

    Thanks, that makes sense, but what about when an XML node has mixed content like this:

    <root>Some mixed<child>child text</child> content text</root>

    The VTDNav.getText() doesn't handle mixed content (VTDNav.getText() just returns "Some mixed" for the XML above instead of "Some mixed content text"), which is why I attempted to handle with my while() loop.

    So am I hitting a situation where a routine that might work for mixed content then breaks when it hits large CDATA > 1MB?

    Can you please tell me the right way to grab all node text for every situation?

    Thanks!

    Robert

     
  • jimmy zhang

    jimmy zhang - 2017-08-27

    you won't need to worry about this in XPath since 2.13... there is a new method that handles mixed content XPath according to the spex

    you could use a new method called getXPathStringVal(). Could you take a look and let me know what you think?

     

    Last edit: jimmy zhang 2017-08-27
  • Robert Yeager

    Robert Yeager - 2017-08-28

    Hi Jimmy,

    I tried getXPathStringVal() as follows:

            VTDNav vn = vg.getNav(); 
            if (vn.matchElement("root"))
            {
                System.out.println("getXPathStringVal() val = "+vn.getXPathStringVal());
            }
    

    Here is the test XML:

    <root>Some mixed<child>child text</child>content text</root>

    Here is the output:

    getXPathStringVal() val = Some mixedchild textcontent text

    I don't understand why it also grabbed the child node's text. I was expecting it would emit "Some mixedcontent text"

    Attached is the new test case.

    I also tried getXPathStringVal() with the Base64 test case. Strangely, it also doesn't work, the >1MB Base64 test returns seemingly just the first roughly 1MB of data:

    TestVTDBase64_pass input xml length = 2833781
    getXPathStringVal() length = 1048575

    Attached is the revised Base64 test case using getXPathStringVal()

    Thank you for helping,

    Robert

     
  • jimmy zhang

    jimmy zhang - 2017-08-31

    Sorry I may have misunderstood your question... to access mixed content huge CDATA text... you may want to resort to XPath or an otherwise little-known class called TextIter() it is part of the VTD-XML api... This class iterates through all text nodes of an element. VTDNav has getText() which is inadequate for mixed content style of XML. text nodes include character_data and CDATA. Since version 2.8, selectText(), selectComment(), and selectPI() were added ... you can basically instantiate the class... select the type of leaf nodes you want to return call touch to initiate the action (from which point .. typically the current element.. the mixed content child nodes will be searched... and call getNext(), make sure you wrap it in a while loop so you don't miss anything...
    but you can just use XPath, which automates the whole process... either way, your choice...

     
  • Robert Yeager

    Robert Yeager - 2017-09-02

    Hi Jimmy,

    I have put together a comprehensive test, attached, that demonstrates various ways in which the VTD-XML parser methods provide varying results when the goal is to fetch all the text of a node.

    The test case consists of a standard single text < 1MB of data, a large CDATA value > 2MB, and a mixed content test.

    For each input XML, the code tests 4 ways of fetching the node text:

    1. xpath text()
    2. VTDNav getXPathStringVal()
    3. VTDNav getTokenType() loop
    4. TextIter

    Here is the output from the attached test cases:

    TestVTDBase64_pass input xml length = 48489
    xpath text() length = 48472
    getXPathStringVal() length = 48472
    navigator loop length = 48472
    TextIter loop length = 48472

    TestVTDBase64_fail input xml length = 2833781
    xpath text() length = 5355531
    getXPathStringVal() length = 1048575
    navigator loop length = 5355531
    TextIter loop length = 2833752

    Mixed content xml = <root>Some mixed<child>child text</child>content text</root>
    xpath text() mixed result = Some mixedcontent text
    getXPathStringVal() result = Some mixedchild textcontent text
    navigator loop result = Some mixed
    TextIter loop mixed result = Some mixedcontent text

    As can be seen, the only TextIter correctly retrieves all node text for all cases correctly. Thank goodness TextIter will meet my needs. You may want to fix or document the other methods, b/c it was not apparent to me that they had limitations or were broken.

    Robert

     
  • Robert Yeager

    Robert Yeager - 2017-09-02

    Attaching a revised test case that just makes the code more correct, to remove the t++ increment within the loop for the TextIter approach:

                    while ((t = ti.getNext()) != -1)
                    {
                        nodeText += vn.toString(t++);
                    }
    
                    while ((t = ti.getNext()) != -1)
                    {
                        nodeText += vn.toString(t);
                    }
    
     

Log in to post a comment.