We are evaluating EXIficient to compress large XML messages, and the results are great. However, we stumbled on an issue where the XML content is corrupted (does not match to the original) after decoding.
We are running in a multithreaded Websphere 7 environment using Java 5. We tested with the latest exificient release, and
we are using the xercesImpl.jar and xml-apis.jar provided (in the endorsed folder due to compatibility issues with
Websphere jars). Our XML schema has about 60 different tags and most data is passed as attributes instead of tag text. All tags are
defined as complex types, and all data is defined as string values. The messages we used for testing are aroung 150k to 250k size.
In terms of multithreading, we create the grammar/EXIFactory once based on the schema, and store it in a ConcurrentHashMap. The access to the factory in
the ConcurrentHashMap is static, but each thread instantiates its own encoder/decoder object.
For example, when we run a test to encode/decode 5000 messages, we noticed an attribute value changed in 10 messages when compared to the original
xml.
In the example above the ATTR_2 value has changed after encoding/decoding the XML. Originally it was "5", but once decoded it changed to "3".
We are puzzled and trying to understand what could cause this issue, and if there are any suggestions. We really want to use EXI for compression.
Thanks
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
from what I am reading this attribute value changes when multiple version are running at the same time. Correct?
Further, the attribute value "5" or respectively "3" are typed as string. Also correct?
That said I would like to note that EXIficient hasn't been really tested in multithreaded software so far. My assumption is the following. The same grammar is used in various threads. Each attribute or characters grammar state on encoder side checks whether the value is valid regarding the schema facet. If the check is true it is encoded with the type given in the schema if not it is encoded as deviation.
Let assume the follwing sequence.
check value "5" -> return TRUE for OK
encode last checked value "5"
The reason for doing so is simply performance. Let's assume the datetime value "2012-10-15". The check methods tries to parse the string and split it into year, month and day. When this is succesful it is wriiten by using the previously parsed values and not parsed again.
In a multithreaded version the following may happen:
Thread1 : check value "5" -> return TRUE
Thread2 : check value "3" -> return TRUE
Thread1 : encode last checked value "3"
Thread2 : encode last checked value "3"
I need to verify this issue and check more closely. That said, currently I can give you two possible solutions:
A) Threads should use different grammars/schemas.
e.g. Thread 1 deals with Schema1 while Thread 2 deals with Schema2
B) Threads create multiple versions of the same grammar if a grammar is already in use by another thread.
If all threads deal with the same schema just create an EXI grammar for each thread seperately.
Hope this helps and will look into the issue more closely,
Daniel
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We are evaluating EXIficient to compress large XML messages, and the results are great. However, we stumbled on an issue where the XML content is corrupted (does not match to the original) after decoding.
We are running in a multithreaded Websphere 7 environment using Java 5. We tested with the latest exificient release, and
we are using the xercesImpl.jar and xml-apis.jar provided (in the endorsed folder due to compatibility issues with
Websphere jars). Our XML schema has about 60 different tags and most data is passed as attributes instead of tag text. All tags are
defined as complex types, and all data is defined as string values. The messages we used for testing are aroung 150k to 250k size.
In terms of multithreading, we create the grammar/EXIFactory once based on the schema, and store it in a ConcurrentHashMap. The access to the factory in
the ConcurrentHashMap is static, but each thread instantiates its own encoder/decoder object.
For example, when we run a test to encode/decode 5000 messages, we noticed an attribute value changed in 10 messages when compared to the original
xml.
before:
<NMSPC:TEST …>
…
<NMSPC:TEST_TAG ATTR_1="2930.33" ATTR_2="3" … ATTR_15="1" />
<NMSPC:TEST_TAG ATTR_1="749.00" ATTR_2="4" … ATTR_15="1" />
<NMSPC:TEST_TAG ATTR_1="13578.80" ATTR_2="5" … ATTR_15="1" />
…
</NMSPC:TEST>
after:
<?xml version="1.0" encoding="UTF-8"?>
<ns4:TEST …>
<ns4:TEST_TAG ATTR_1="2930.33" ATTR_2="3" … ATTR_15="1" />
<ns4:TEST_TAG ATTR_1="749.00" ATTR_2="4" … ATTR_15="1" />
<ns4:TEST_TAG ATTR_1="13578.80" ATTR_2="3" … ATTR_15="1" />
…
</ns4:TEST>
In the example above the ATTR_2 value has changed after encoding/decoding the XML. Originally it was "5", but once decoded it changed to "3".
We are puzzled and trying to understand what could cause this issue, and if there are any suggestions. We really want to use EXI for compression.
Thanks
Jorge
Hi Jorge,
from what I am reading this attribute value changes when multiple version are running at the same time. Correct?
Further, the attribute value "5" or respectively "3" are typed as string. Also correct?
That said I would like to note that EXIficient hasn't been really tested in multithreaded software so far. My assumption is the following. The same grammar is used in various threads. Each attribute or characters grammar state on encoder side checks whether the value is valid regarding the schema facet. If the check is true it is encoded with the type given in the schema if not it is encoded as deviation.
Let assume the follwing sequence.
check value "5" -> return TRUE for OK
encode last checked value "5"
The reason for doing so is simply performance. Let's assume the datetime value "2012-10-15". The check methods tries to parse the string and split it into year, month and day. When this is succesful it is wriiten by using the previously parsed values and not parsed again.
In a multithreaded version the following may happen:
Thread1 : check value "5" -> return TRUE
Thread2 : check value "3" -> return TRUE
Thread1 : encode last checked value "3"
Thread2 : encode last checked value "3"
I need to verify this issue and check more closely. That said, currently I can give you two possible solutions:
A) Threads should use different grammars/schemas.
e.g. Thread 1 deals with Schema1 while Thread 2 deals with Schema2
B) Threads create multiple versions of the same grammar if a grammar is already in use by another thread.
If all threads deal with the same schema just create an EXI grammar for each thread seperately.
Hope this helps and will look into the issue more closely,
Thanks for the quick reply Daniel.
Your understanding is correct. Both values are defined as strings in the schema, and multiple threads use the same grammar to encode/decode the XML.
I will try to test this by wrapping the grammar in a ThreadLocal object since building that grammar for every thread will be a performance hit.
Let me know if you find out anything.