OpenML for Predictive Modelling in Food / Discussion / Tec - Predictive microbial modelling: What information should be described in the exchange format of this sub-domain

Matthias Filter - 2013-09-25

An excerpt from previous discussions:

**(at a first attempt this thread is intended to focus on the classical primary / secondary predictive model approach with one dependent variable)""

Specification of models should contain:

Formula(s):
+ explicit specification (e.g. as MathML string) (mandatory)
+ name (optional),
+ literature reference (optional)

Model parameters / coefficients used in formula(s):
+ name (mandatory),
+ value (mandatory),
+ standard deviation (optional),
+ variance-covariance matrix (optional),
+ unit (optional)

Model variables used in formula(s):
+ name (mandatory),
+ valid range (mandatory),
+ unit (mandatory),
+ standard value (optional)

Environmental factors not included explicitely in the model (e.g. food matrix, pressure, CO2 etc.):
+ proposed area of validity (optional)

Model generation:
+ raw data used for model generation (optional)
+ measures of goodness of fit on raw data used for model generation (mandatory):
+ residuals (optional)

Model metadata:
+ name (mandatory),
+ (literature) reference (optional),
+ assignment to process type (growth / inactivation / survival / cross-contamination etc. or any combination thereof) (mandatory)
+ created by (optional),
+ created when (optional)

Last edit: Matthias Filter 2013-09-26
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andras Gefferth - 2013-10-04
  
  In general I agree with the list above, and I agree with Fernando regarding the importance of these considerations.
  
  However, I have a few comments/questions.
  Some of these may only be due to my limited background in food modeling.
  
  1, Formula(s)
  So is it intended that more than one formula belongs to the model?
  E.g. one for the primary and some others for the secondary models?
  In this case they should be connected explicitly to know which secondary formula described which primary parameter.
  Also in this case each formula section has to have its own parameters and variables section.
  
  Literature reference: is it different from the literature reference in #6?
  
  2, Model parameters
  
  Value: is this really mandatory? we could differentiation between bound and unbound models, where in the latter case the parameter values are left to the model user to calibrate based on own measurements.
  
  .
  
  Standard deviation, variance, covariance matrix: It is not clear what these mean. Are they connected to how the parameters were calibrated? But then they are affected by the calibration method. Or maybe I'm missing something here.
  
  5, Model generation
  I assume the raw data would be described using the "lab data description" ML format.?
  
  General
  I do not see where we define what the model is for. E.g. if I create a model to describe what happens to Bacteria X during plasma treatment, how will the model user know that this model is for Bacteria X and for plasma treatment?
  In which section would this information be stored?
  
  Or maybe I want to create a model for a combination of two or more treatments, where can I define which treatments are these and in what order they were performed?
  
  Another point: I see nothing which would be specific to primary / secondary type of models in the above list.
  
  Last edit: Andras Gefferth 2013-10-04
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Matthias Filter - 2013-10-05
    
    1, Formula(s)
    So is it intended that more than one formula belongs to the model?
    
    Yes, I think this should be possible.
    
    E.g. one for the primary and some others for the secondary models?
    In this case they should be connected explicitly to know which secondary formula described which primary parameter.
    Also in this case each formula section has to have its own parameters and variables section.
    
    I think an explicit connection of formulas is only necessary, if the parameter and variable names do not fit together. E.g. as in the following example: y=m*x+n and r=T^2 +u with m=r
    
    Literature reference: is it different from the literature reference in #6?
    
    Yes, #6 refers to a fitted model, i.e. a formula with estimated model parameters, where the reference points to the paper that made the fitting. #1 refers to the formula itself, e.g. like Baranyi or Gompertz etc. - usually these references are more mathematical in nature.
    
    2, Model parameters
    Value: is this really mandatory? we could differentiation between bound and unbound models, where in the latter case the parameter values are left to the model user to calibrate based on own measurements.
    Standard deviation, variance, covariance matrix: It is not clear what these mean. Are they connected to how the parameters were calibrated? But then they are affected by the calibration method. Or maybe I'm missing something here.
    
    I think, this is a misunderstanding as we do not have synchronized our terminology. I use the term model for an equation with already determined parameters / coefficients (only the variables have to be specified by the users to make a prediction). I think your comment refers to formulas where the parameters / coefficients are not yet determined. I personally thought that the information exchange format will probably only be used for already determined models.
    
    5, Model generation
    I assume the raw data would be described using the "lab data description" ML format.?
    
    If provided, this would be reasonable. As you know, there is quite frequently the situation, that raw data are not available any-more, but only the parameter estimates of the model can be extracted from a publication.
    
    General
    I do not see where we define what the model is for. E.g. if I create a model to describe what happens to Bacteria X during plasma treatment, how will the model user know that this model is for Bacteria X and for plasma treatment?
    In which section would this information be stored?
    Or maybe I want to create a model for a combination of two or more treatments, where can I define which treatments are these and in what order they were performed?
    
    Good point! I would suggest to define these items also under #4. There it should be possible to define environmental factors also as a time dependent entity, by which one could also define the "order of treatments"
    
    Another point: I see nothing which would be specific to primary / secondary type of models in the above list.
    
    In a sense you are right, but if you go into the data mining domain (neural networks, decision trees, time series analysis etc.) there are so many special methods possible, that could in principle also be applied to predictive microbiology, that I thought it might be better to exclude these methods in this first discussion round.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Andras Gefferth - 2013-10-06
      
      I think an explicit connection of formulas is only necessary, if the parameter and variable names do not fit together.
      
      You are right. It may actually be a requirement that these have to fit, then there is no need for explicit connection.
      
      I personally thought that the information exchange format will probably only be used for already determined models.
      
      It is good that we agree on a terminology.
      
      My experience with these kinds of models is limited, I don't know how many times a formula is recycled to create a new model. (For example I don't know that when it is recycled, then this new model applies to some totally different food matrix, or maybe the same experiment is repeated e.g. with more samples so a better fitting can be obtained.)
      So the whole point is that I'm not the right person to judge if an "empty" model, that is the formula only, has any value in itself or not.
      However, if it has, then I think the ML could be easily extended to cover this case.
      For example I have in mind the following scenario, where such a formula definition would come handy:
      1. I download a formula definition file,
      2. I make measurements
      3. I Store my measurements in the database with appropriately named columns
      4. The software would then match the formula with my measurements, and perform the calibration of the parameters automatically to create the model from the formula and the measurements.
      
      Another issue relating to the same point: Even in this case I am not sure about the meaning of the standard deviation/variance/cov matrix of the parameters. E.g. if we perform a mean-square estimation then we have one set of parameters and the residual mean-square error describes how well it was fitted. If we perform maximum likelihood (or some similar) then we can measure success with a probability density value.
      
      I would suggest to define these items also under #4. There it should be possible to define environmental factors also as a time dependent entity, by which one could also define the "order of treatments"
      
      Actually one reason why I ask this is that this would bring us to my "favourite" subject :), the flow-chart specification.
      
      if you go into the data mining domain (neural networks, decision trees, time series analysis etc.) there are so many special methods possible
      
      I see. That's right. It is also important to have a good idea about what is not in the scope of the ML. We can add stochastic modelling to this list as well.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Matthias Filter - 2013-10-11
        
        I personally thought that the information exchange format will probably only be used for already determined models.
        ...
        It is good that we agree on a terminology.
        ...
        However, if it has, then I think the ML could be easily extended to cover this case.
        
        I agree. The point is, that formulas ("empty" models) are much, much easier to define. AND: in my understanding only formulas can be "recycled". Once you have created a model by fitting one or more formulas to your data, that's it. Of course one can think of "updating an existing model" (as e.g. in the Bayesian or Neural Network domain), but I would prefer to handle even these updated models as independent entities (like e.g. in the software domain, where you can have software with different versions).
        
        ...E.g. if we perform a mean-square estimation then we have one set of parameters and the residual mean-square error describes how well it was fitted. If we perform maximum likelihood (or some similar) then we can measure success with a probability density value.
        
        In my opinion this relates to the question "Which item should be part of the format definition". I think what you describe fits into "section 5 - Model generation" and there into "+ measures of goodness of fit on raw data used for model generation (mandatory)". So a "probability density value" could be one of the possible optional values (as e.g. MSE, R^2, RMSE, BIC, AIC, SSE etc.).
        
        Actually one reason why I ask this is that this would bring us to my "favourite" subject :), the flow-chart specification.
        
        I think, this also needs a bit of clarification. Is "flow-chart specifications" the same as a description of "food processing / handling chains"? If that's the case, then this might be a bit out of scope of this discussion thread. Usually you would not create ONE model for the whole production process, but you would apply several different models consecutively (e.g. pasteurization of milk might use growth, survival and inactivation models). My thinking in this area is very much based on the Modular Process Risk Model concept. So if we want to achieve harmonization of the description of single process steps and process chains it could be useful to open up a new discussion thread for that topic (which still is of highest relevance for all!).
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Andras Gefferth - 2013-10-15
        
        one can think of "updating an existing model" (...), but I would prefer to handle even these updated models as independent entities_
        
        totally agree
        
        In my opinion this relates to the question "Which item should be part of the format definition".
        
        I think I was completely misleading here. I actually wanted to understand what is standard deviation and variance-covariance matrix in section 2, I just used these as examples. But I know it wasn't very clear
        
        "food processing / handling chains"?
        
        Yes, I mean the food processing chain. I agree that there will not be a single model for the entire chain, however one model may sometimes cover more than one single step, which can be described by a mini-chain. But even in case a single step, which is a border case of a chain, it has to be defined what this step is, or what kind of treatments is this model for.
        
        So if we want to achieve harmonization of the description of single process steps and process chains it could be useful to open up a new discussion thread for that topic
        
        I agree that this is another topic and needs to go to a different thread.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Fernando Perez-Rodriguez - 2013-09-26

In the first instance, I would like to hightlight the importance to give these initial thoughts and proposals about the relevant elements that we should consider for harmonization or better interpretation and use of predictive models. So thank Matthias for that. Regarding these specifications, I would like to briefly comment that units should be also provided for each environmental factors. In the case of best-fit estimates for model parameters, CI or SE should be provided in order to assess the model uncertainty or error. In microhibro, mainly based on models taken from litearture, we have found such a limitation since in many cases, no CIs are provided for regression parameters or at least SEs for them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

What information should be described in the exchange format of this...

Forums

Help

What information should be described in the exchange format of this sub-domain - a collection

What information should be described in the exchange format of this...

Forums

Help

What information should be described in the exchange format of this sub-domain - a collection document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

What information should be described in the exchange format of this sub-domain - a collection