Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Compare whether an XML1 is a subset of XML2

Help
Dxun
2010-05-12
2013-03-03
  • Dxun
    Dxun
    2010-05-12

    Hello everyone.

    Recently, I've come up on a situation where I should compare two XMLs but in a somewhat non-standard way, so let me briefly describe the current specification.

    Consider two XMLs, A and B.
    Let us denote A a "comparison XML" and B a "reference XML".

    The comparison C that I need to construct is basically described with relation C = A ⊆B, and it should return a boolean operator: true for case when C holds, and false otherwise. So, the way I see it - whenever I find a node in A that is not in B, or is "semantically-different" from reference node in B, we should return false and terminate immediately.

    Let me give a brief example:

     B.xml - reference
    <A>
       <B>
           <C/>
           <D/>
       </B>
       <X>
           <Y/>
           <Z/>
       </X>
    </A>
    

    So, for this case C should return true:

     A1.xml - comparison
    <A>
       <B>
           <D/>
           <C/>
       </B>
    </A>
    

    Observe the ordering in B node - if complete ordering inside a node is permutated, we consider it to be "semantically-equivalent" with reference XML.
    This case would see C return false:

     A2.xml - comparison
    <A>
       <B>
           <D/>
       </B>
       <C/>
    </A>
    

    Also, trivially, C would also return false in this case:

     A2.xml - comparison
    <A>
       <B>
           <D/>
           <C/>
       </B>
       <P/>
    </A>
    

    So, since I am very fresh with XMLUtil (I discovered it just yesterday), I would basically need a few pointers (pardon me, references :) ) from people knowledgeable with XMLUtil API:
    - what interfaces should I override? First MatchTracker comes to mind with its methods matchFound() and printNode() (it would be nice to see where A differs on B, but it's not mandatory),
    - if possible, guesstimate how much work and implementation would this ensue? More specifically, would I be better off constructing the algorithm myself if it would take too much time to fiddle with API (DifferenceEngine) in order to get this thing done?

    Also, great help would be to make XMLUtil perform comparisons one-way only, i.e. to report differences in direction A -> B, not the other way around.

    Any other insight regarding this is also appreaciated.

    Thank you very much for your time.

     
  • Stefan Bodewig
    Stefan Bodewig
    2010-05-12

    There are two, maybe even three interfaces that play together for your case.

    DifferenceListener is notified whenever two things differ and has the power to say "ignore this difference".  This is where you'd filter out all differences that stem from elements that are only part of the reference but not the comparison XML.  See the manual.

    ComparisonControler says "I've seen enough, the relation doesn't hold".  You probably don't need an implementation of your own since the Diff class will do what you want - as long as your DifferenceListener will mark all A ⊆B as "recoverable".  In then end you'd get three states: identical, real subset (AKA similar) and not a subset at all.

    Depending on your XML you may need to specify an ElementQualifier if the element's name alone is not enough to find matches.  Make sure you set up XMLUnit to report unmatched elments rather than chose random matches.

    So "all you really need to do" is to figure out which sorts of differences are compatible with your notion of subset and downgrade those to "recoverable" in your DifferenceListener implementation.  It shouldn't take too long to implement that.

    XMLUnit's comparisons are symmetric and there is no way to avoid that, sorry.

     
  • Dxun
    Dxun
    2010-05-12

    Also, herr Bodewig, vielen dank fuer die schnelle Antwort - es wird eine echte Herausforderung dieses Problem zu schlagen.

    Firstly, an important point I perhaps neglected to mention earlier on - I need this functionality to function recursively; i.e. no matter what depth and sizes A and B may be, the engine will inspect and compare nodes and produce results. Is this guaranteed? If yes (and since I have quite limited experience with DOM and XML) - is there any considerations to be made regarding size and speed? As I understand, DOM stack is naturally limited by RAM, but I don't have the figures….and B could grow to considerable size (I've seen uncompressed XML dumps of B that exceed 3 MB on disk - not sure if that's small, large or too large, though….would you have any insight regarding this?).

    Secondly, I'll definitively delve into source code, so before I do, a few quick questions:
    1) regarding DifferenceListener - no questions right now, need to study your code first,
    2) regarding ComparisonController - recoverable difference basically advise Diff that the XML nodes are similar?
    3) regarding ElementQualifier - so basically, you advise to use this if XML nodes should have the same names but different attributes?

    Thank you for your time and explanations - they've been very helpful so far!

     
  • Stefan Bodewig
    Stefan Bodewig
    2010-05-14

    The DifferenceEngine works recursively, yes.  The only implementation right now is DOM based - well, using the Java version, that is - and this implies a certain performance and memory penalty.  Doing it a different way - the current .NET version uses a pull parser approach - is less flexible or more complex to code.

    The file size of an XML document usually doesn't give enough information since the depth of the tree may be quite different for two serialized documents.  It also depends on the parser implementation to a certain extent.  Your best option is to read some sample documents into DOM Document instances and see what you end up with.  Personally I wouldn't consider 3 MB to be big and would expect Java to handle that, although you may end up using a three digit amount of megs of memory.

    Regarding your question (2): yes.  No differences => equal. At least one difference, all differences are recoverable => similar.  At least one non-recoverable difference =>  different.

    The answer to (3) can become arbitrarily complex and if you look through the forums or mailing list archives or even the bug tracker you'll see that many questions are asking for the best ElementQualifier implementation that would allow certain documents to be considered similar.  If all that is needed to match the elements that should be compared are the element's name and some attributes, the built-in ElementNamAndAttributesQualifier will do.