Menu

Content changes after encoding/decoding

Help
2012-10-14
2013-04-08
  • Victor Jorge

    Victor Jorge - 2012-10-14

    We are evaluating EXIficient to compress large XML messages, and the results are great. However, we stumbled on an issue where the XML content is corrupted (does not match to the original) after decoding.

    We are running in a multithreaded Websphere 7 environment using Java 5. We tested with the latest exificient release, and
    we are using the xercesImpl.jar and xml-apis.jar provided (in the endorsed folder due to compatibility issues with
    Websphere jars). Our XML schema has about 60 different tags and most data is passed as attributes instead of tag text. All tags are
    defined as complex types, and all data is defined as string values. The messages we used for testing are aroung 150k to 250k size.
    In terms of multithreading, we create the grammar/EXIFactory once based on the schema, and store it in a ConcurrentHashMap. The access to the factory in
    the ConcurrentHashMap is static, but each thread instantiates its own encoder/decoder object.

    For example, when we run a test to encode/decode 5000 messages, we noticed an attribute value changed in 10 messages when compared to the original
    xml.

    before:
    <NMSPC:TEST …>

    <NMSPC:TEST_TAG ATTR_1="2930.33"  ATTR_2="3" … ATTR_15="1" />
    <NMSPC:TEST_TAG ATTR_1="749.00"   ATTR_2="4" … ATTR_15="1" />
    <NMSPC:TEST_TAG ATTR_1="13578.80" ATTR_2="5" … ATTR_15="1" />

    </NMSPC:TEST>

    after:
    <?xml version="1.0" encoding="UTF-8"?>
    <ns4:TEST …>

    <ns4:TEST_TAG ATTR_1="2930.33"  ATTR_2="3" … ATTR_15="1" />
    <ns4:TEST_TAG ATTR_1="749.00"   ATTR_2="4" … ATTR_15="1" />
    <ns4:TEST_TAG ATTR_1="13578.80" ATTR_2="3" … ATTR_15="1" />

    </ns4:TEST>

    In the example above the ATTR_2 value has changed after encoding/decoding the XML. Originally it was "5", but once decoded it changed to "3".
    We are puzzled and trying to understand what could cause this issue, and if there are any suggestions. We really want to use EXI for compression.

    Thanks
    Jorge

     
  • Daniel Peintner

    Daniel Peintner - 2012-10-15

    Hi Jorge,

    from what I am reading this attribute value changes when multiple version are running at the same time. Correct?
    Further, the attribute value "5" or respectively "3" are typed as string. Also correct?

    That said I would like to note that EXIficient hasn't been really tested in multithreaded software so far. My assumption is the following. The same grammar is used in various threads. Each attribute or characters grammar state on encoder side checks whether the value is valid regarding the schema facet. If the check is true it is encoded with the type given in the schema if not it is encoded as deviation.

    Let assume the follwing sequence.
    check value  "5" -> return TRUE for OK
    encode last checked value "5"

    The reason for doing so is simply performance. Let's assume the datetime value "2012-10-15". The check methods tries to parse the string and split it into year, month and day. When this is succesful it is wriiten by using the previously parsed values and not parsed again.

    In a multithreaded version the following may happen:
    Thread1 : check value  "5" -> return TRUE
    Thread2 : check value  "3" -> return TRUE
    Thread1 : encode last checked value "3"
    Thread2 : encode last checked value "3"

    I need to verify this issue and check more closely. That said, currently I can give you two possible solutions:

    A) Threads should use different grammars/schemas.
      e.g. Thread 1 deals with Schema1 while Thread 2 deals with Schema2

    B) Threads create multiple versions of the same grammar if a grammar is already in use by another thread.
      If all threads deal with the same schema  just create an EXI grammar for each thread seperately.

    Hope this helps and will look into the issue more closely,

    • Daniel
     
  • Victor Jorge

    Victor Jorge - 2012-10-17

    Thanks for the quick reply Daniel.

    Your understanding is correct. Both values are defined as strings in the schema, and multiple threads use the same grammar to encode/decode the XML.

    I will try to test this by wrapping the grammar in a ThreadLocal object since building that grammar for every thread will be a performance hit.

    Let me know if you find out anything.

     

Log in to post a comment.