Menu

Problems generating large dictionary

gbonk
2013-11-18
2013-11-19
  • gbonk

    gbonk - 2013-11-18

    I've been tasked with creating a dictionary of the OAGIS schema. I've increased the total memory allotted to CAM to 4gb.

    For each BOD in the OAGIS standard it takes a couple of days to generate the dictionary on my fairly new laptop. A typically sized CAM template is around 450mb.

    I was wondering if there is anything more I can use to speed up the process or like a command line batch method to process multiple files.

    Also, would a product like Oracle Enterprise Repository make it easier to generate dictionaries?

     
  • drrwebber

    drrwebber - 2013-11-18

    I had noticed those OAGi BOD schema are very large. I'm suspecting there is a high degree of redundancy in there somewhere we can probably shave off to help speed everything up - since you only need that stuff one time somewhere - not everywhere.

    I've not looked in detail at what OAGi has - but I seem to recall each schema is importing a lot of global stuff.

    For example - in the case of NIEM what we did was comment out the structures.xsd metadata attributes. These are never used to an actual exchange - just cosmetic. Since every element had 5 of these by default - removing these shrunk the CAM template by a factor x5!

    Perhaps BOD message headers are one candidate?

     
  • drrwebber

    drrwebber - 2013-11-18

    BTW - I should add - we are VERY interested in having this succeed - a OAGI BOD dictionary would be super useful - I know this is a non-trivial undertaking. The 12 NIEM dictionaries we have already took two months to successfully generate because we had to solve a number of technical issues in the process. Of course once you get there - future updates are much quicker.

    It may make sense to pool our resources on this one and get together to talk it through.

    Thanks, David

     
  • gbonk

    gbonk - 2013-11-18

    I will attach the one that I'm working on... For this particular one, it is the GetSalesOrder.

    The cam file is 6.8mb and with the 4 gb allotted to CAM it began to write the dictionary file but crashed after writing 28.8mb of dictionary file to disk

    My preference would be to optimize CAM so there is no modification of the BOD directly, but if there are some things that can be done quickly and easily for now that would help, or is there a way to break up the parts as well ?

     
  • drrwebber

    drrwebber - 2013-11-18

    Makes sense. I just processed this and picked SalesOrder - instead of GetSaleOrder - so I pulled in just the payload - (and not the outer Application Area and Get parts).

    The size is 6.7mb then.

    There are indeed a lot of attributes on items that are repeated groups and also elements.

    I see 340 repetitions of

    <DocumentID agencyRole="%string%"> 
    <AlternateDocumentID agencyRole="%string%">
    

    and then 5,182 repetitions of

    listID="%string%" listAgencyID="%string%" listAgencyName="%string%" listName="%string%" listVersionID="%string%" name="%string%" languageID="%en-US%" listURI="%http://wiki.oasis-open.org/cam%" listSchemeURI="%http://wiki.oasis-open.org/cam%"
    

    CAM does have the excludeAttribute() and excludeElement() rules to pluck out things not needed - and then run the File / Export / Compress Template first before continuing.

    You can try this and see. The Console tab window shows you how far the dictionary builder has progressed - key to knowing when it stops.

    Otherwise I am thinking along the lines of maybe modifying the dictionary XSLT so it can take short cuts around all this - since we know the patterns OAGi is using above - put in an annotation to replace attribute groups for example.

     
  • drrwebber

    drrwebber - 2013-11-19

    Reflecting on those listID, listAgencyID and such metadata, it is an interesting point in the overall strategy of having your components in a dictionary rather than a schema. The schema should really focus on the exchange format, but here is also being used to perform repository management roles. Going forward you really want those separate.

    Currently the XSLT for dictionary build is looking to analyse your exchange components - so these metadata parts really need to be treated separately - and probably a new "bucket" provided in the dictionary to support these metadata. In ebXML Registry this is handled using slots. All food for thought. I think the first task is just to get something initial working - extracting the exchange components into a dictionary - then we can go from there.

     

    Last edit: drrwebber 2013-11-19

Log in to post a comment.