I am investigating the potential risks of making a part-of-speech tagger model generally available to the public that is trained on sensitive/private data. We have some clinical notes that have been tagged for part-of-speech and are in the process of training the OpenNLP part-of-speech tagger on our corpus plus Penn Treebank and GENIA (or some subset of these three.) Because the clinical data has personal medical information in it - it is vitally important that we do not compromise the confidentiality of the original data. The question I want a definitive answer to is this: if we made such a model available is there any way to reverse engineer the model such that one could reconstruct fragments (e.g. word bigrams, word trigrams, or even likely sentences) of the original data from the model that would violate our patient confidentiality requirements (which are very strict - we would hate to see, for example, a trigram like "Thomas Mortan halitosis" surface!)
I have been looking into this question and my answer is that there is no risk in releasing a part-of-speech tagged model (to the extent that word unigrams are not a problem.) I have been looking at the model "tag.bin.gz" available and examining the contents of this GISModel. My conclusion is that there are no parameters that directly contain word bigrams or trigrams (which is a direct result of the work DefaultPOSContextGenerator is doing (or is not doing)). Furthermore, it would be difficult to extract likely bigrams (much less trigrams). This is because a context corresponding to a "previous word" has a parameter related to the outcome (pos tag) rather than the word being classified. So, the only way you could get a likely bigram is if there was some outcome that had very few contexts associated with it.
This same conclusion would apply to releasing a chunker model if the context generator didn't include word bigrams as the DefaultChunkerContextGenerator does.
Any advice, clarification, correction on my concerns and observations above that you might have are greatly appreciated.
Thanks,
Philip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Your analysis looks basically correct. Also note that features have to occur a certain number of times (I think this parameter is set to 10 for the pos-tagger) so very infrequent words won't make it into your model.
You can verify this also by converting your model to the text format. The main of SuffixSensitiveGISModelReader will do this for you.
For the chunker you might try replacing capitalized words which aren't in a dictionary (that you've cleaned of your personalized data) with a default token (like _NAME) so that your performance doesn;t suffer too much.
Hope this helps...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am investigating the potential risks of making a part-of-speech tagger model generally available to the public that is trained on sensitive/private data. We have some clinical notes that have been tagged for part-of-speech and are in the process of training the OpenNLP part-of-speech tagger on our corpus plus Penn Treebank and GENIA (or some subset of these three.) Because the clinical data has personal medical information in it - it is vitally important that we do not compromise the confidentiality of the original data. The question I want a definitive answer to is this: if we made such a model available is there any way to reverse engineer the model such that one could reconstruct fragments (e.g. word bigrams, word trigrams, or even likely sentences) of the original data from the model that would violate our patient confidentiality requirements (which are very strict - we would hate to see, for example, a trigram like "Thomas Mortan halitosis" surface!)
I have been looking into this question and my answer is that there is no risk in releasing a part-of-speech tagged model (to the extent that word unigrams are not a problem.) I have been looking at the model "tag.bin.gz" available and examining the contents of this GISModel. My conclusion is that there are no parameters that directly contain word bigrams or trigrams (which is a direct result of the work DefaultPOSContextGenerator is doing (or is not doing)). Furthermore, it would be difficult to extract likely bigrams (much less trigrams). This is because a context corresponding to a "previous word" has a parameter related to the outcome (pos tag) rather than the word being classified. So, the only way you could get a likely bigram is if there was some outcome that had very few contexts associated with it.
This same conclusion would apply to releasing a chunker model if the context generator didn't include word bigrams as the DefaultChunkerContextGenerator does.
Any advice, clarification, correction on my concerns and observations above that you might have are greatly appreciated.
Thanks,
Philip
Hi,
Your analysis looks basically correct. Also note that features have to occur a certain number of times (I think this parameter is set to 10 for the pos-tagger) so very infrequent words won't make it into your model.
You can verify this also by converting your model to the text format. The main of SuffixSensitiveGISModelReader will do this for you.
For the chunker you might try replacing capitalized words which aren't in a dictionary (that you've cleaned of your personalized data) with a default token (like _NAME) so that your performance doesn;t suffer too much.
Hope this helps...Tom
Thank you for the suggestions! I really appreciate them.
Regards,
Philip