CreateFromDocument() hangs

Help
Nicomachus
2013-06-16
2013-06-17
  • Nicomachus
    Nicomachus
    2013-06-16

    Greetings, I have been using PyXB Version 1.2.2 for a few weeks on a major project and have found it to be very capable and well suited to the processing of advanced schema. My organization is very excited about making use of this sophisticated technology. Recently I have run into a challenge that I am unable to resolve on my own. The surface symptom is an apparent "hang" in CreateFromDocument() when called from PyXB's Class.py generated file. The issue seems to be related to validation processing of a large number of links. Here is what the XML looks like:

       <MySpecialPool behavior="CreateUpdateDefaulting" excludedFromETag="false">
            <link href="https://10.0.0.1:12443/rest/api/uom/MySystem/caae9209-25e5-35cd-a71a-ed55c03f294d/MySpecialPool/f2b33279-e9b7-3444-b388-70c6e6fa74b6" rel="related"/>
            <link href="https://10.0.0.1:12443/rest/api/uom/MySystem/caae9209-25e5-35cd-a71a-ed55c03f294d/MySpecialPool/f5ef7694-fdf6-366d-86a0-39f2c481ee5c" rel="related"/>
            <link href="https://10.0.0.1:12443/rest/api/uom/MySystem/caae9209-25e5-35cd-a71a-ed55c03f294d/MySpecialPool/1ec6b678-2fd9-3ef8-9a86-e5c7ed1cbec4" rel="related"/>
            <link href="https://10.0.0.1:12443/rest/api/uom/MySystem/caae9209-25e5-35cd-a71a-ed55c03f294d/MySpecialPool/b99b936f-8666-3622-8fdb-c4c807909b8c" rel="related"/>
        <!-- 26 more similar links --> 
        </MySpecialPool>
    

    There are approximately 30 of these links. Experimentally, if I reduce the number of links to 3, CreateFromDocument() returns after a few minutes (so the "hang" is probably a very long delay). I have tracked the problem as far as the following PyXB code:

    pyxb/binding/content.py AutomatonConfiguration.step():

     372         # closures that will update the instance content based on the path.
     373         for (cfg, pending) in multi:
     374
     375             cand = cfg.candidateTransitions(sym)
     376             for transition in cand:
     377                 clone_map = {}
     378                 ccfg = cfg.clone(clone_map)
     379                 new_multi.append( (transition.apply(ccfg, clone_map), pending+(transition.consumedSymbol().consumingClosure(sym),)) ) 
     380         rv = len(new_multi) 
     381
     382         if 0 == rv:
     383             # No candidate transitions.  Do not change the state.
     384             return 0
     385         if 1 == rv:
     386             # Deterministic transition.  Save the configuration and apply the
     387             # corresponding updates.
     388             self.__multi = None
     389             (self.__cfg, actions) = new_multi[0]
     390             for fn in actions:
     391                 fn(self.__instance)
     392         else:
     393             # Non-deterministic.  Save everything for subsequent resolution     .
     394             self.__cfg = None
     395             self.__multi = new_multi
    

    Here is what I have been able to deduce with my crude debugging of this code snippet. '__multi' appears to grow exponentially always falling into the 'else' "Non-deterministic" path. I get the impression that there are unresolved automaton configurations piling up exponentially and the code is searching for a resolution.

    One additional data point. pyxb.RequireValidWhenParsing(False) bypasses this problem (the above code path is not called), although I suspect it skips other validation checks as well, which would not be desirable in the long run.

    Any assistance would be great appreciated! Thank you for listening!

     
    Last edit: Nicomachus 2013-06-16
  • Peter A. Bigot
    Peter A. Bigot
    2013-06-17

    This would probably be https://sourceforge.net/apps/trac/pyxb/ticket/173

    If you have control of the schema, you can probably eliminate the problem by refactoring it. If you can't modify the schema but they are publicly available, you could add a comment to the ticket describing where to obtain them and provide a sample document that reproduces the problem and I'll look into adding that feature.

     
    • Nicomachus
      Nicomachus
      2013-06-17

      Peter, I am very grateful for your quick response, ticket info, and suggestions! Things are starting to make sense.

      Can you expound a little bit more, in general terms, on how refactoring could eliminate this issue? Mainly, what category of changes should we be looking to make? I would like to be able to communicate a strategy to the team that authored the schema.

      Thank you again!

       
  • Peter A. Bigot
    Peter A. Bigot
    2013-06-17

    The example referenced in the ticket has a structure like the following:

    <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">
      <!-- Essentially: (a*)* -->
      <xsd:element name="a" type="xsd:string"/>
      <xsd:complexType name="parametersType"> 
        <xsd:sequence minOccurs="0" maxOccurs="unbounded">
          <xsd:choice>
            <xsd:element ref="a" minOccurs="0" maxOccurs="unbounded"/>
          </xsd:choice>
        </xsd:sequence>  
      </xsd:complexType> 
      <xsd:element name="parameters" type="parametersType"/>
    </xsd:schema>
    

    Here you have something that becomes, in regular expression terms, (a*)*.

    Represent a structural match of the inner closure by parentheses. For a two
    character sequence there are two parses, depending on which closure is continued:

    (aa) (a)(a)
    

    For a three, there are four:

    (aaa) (aa)(a) (a)(aa) (a)(a)(a)
    

    You can see that the number of alternatives grows exponentially. Because the Finite Automaton with Counters doesn't know what comes next, it has to keep them all in case a subsequent element rules out one or more alternative.

    From the accepted language perspective (a*)* is equivalent to a*. From a structure perspective it is not. PyXB's content model strictly represents the structural derivations, which means the ambiguity in the schema is retained in the candidate parses.

    Eliminate the ambiguity. In the schema above that could be done with:

    <xsd:complexType name="parametersType"> 
      <xsd:sequence minOccurs="0" maxOccurs="unbounded">
        <xsd:choice>
          <xsd:element ref="a"/>
        </xsd:choice>
      </xsd:sequence>  
    </xsd:complexType>
    

    Doing so does not change the XML that is accepted, it just changes the description of how the XML is processed. If your schema team can identify the cases where this occurs they should be able to eliminate them.

    You can see more on the theoretical background of this, including references to special cases where the structure that was discarded by a previous greedy implementation is important, at http://pyxb.sourceforge.net/arch_content.html#validating-the-content-model.

     
    • Nicomachus
      Nicomachus
      2013-06-17

      Peter, thank you very much for taking the time to illuminate this issue. Your explanation has been very helpful. Cheers!