I have a system going that takes records from local emergency rooms and classifies them by "syndrome" based on rules written in SQL. It uses a full text index (tsearch2) in PostgreSQL and works just fine.
However, it would be a great extension of this system to also run a Bayesian test for new records based on prior classification to try to catch other subtle attributes of records that would be missed by the simple SQL word search. Also, this would allow use of probabilities rather than the binary AND/OR/NOT I currently have with SQL.
I have done a little testing, and I have dbacl proving it can do the job, but I am not sure if I am doing it the the best way.
Currently, each record is tested with a SQL select for each syndrome, and if it matches, it matches. There are cases where syndromes are mutually exclusive, in which case the SQL for the excluded syndrome is simply "AND NOT IN"'d on the the included syndrome's SQL.
I extracted data for a couple syndromes (records that matched the SQL select) and trained dbacl on them. I thought I could run them one at a time and use a threshold as indicated in the docs, but can't really figure it out. Something like "This record has a 24% probability of being an actual case of syndrome X" How do you get that kind of result for a single category test?
What I do have working is a comparison between several categories, and an "All Others" category. I think this is really the way dbacl was meant to work since it is analogous to "Spam/Notspam" and so many examples look that way. The only issue with that is that if the record matches multiple syndromes, the percentages get low as they are spread over multiple syndromes, so I don't know where to set my threshold. For example, a record could be 23% sepsis, 12% respiratory, 18% asthma, etc.
I think I would like to be able to do a single category test with a threshold to keep this more exactly like the existing system so I don't scare anyone when I try to explain it.
Hope this makes sense...
Ian
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This sounds very interesting, but I'd like to make a caveat before going into the technical details.
dbacl is probably not the best tool for what you want to do. It classifies text by looking at all the words equally, so while it is true that it will pick up learned syndromes in free form text (I assume you are talking about free form text?) it can also pick up other artifacts, such as peoples' names if they occur frequently. If your records have a more rigid structure such as containing keywords only, then dbacl's assessment might be more meaningful medically.
The medical statistics community has developed much more appropriate methods of inference based on real models and
experience, and free systems such as BUGS (http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml) are designed to rapidly prototype such models graphically, and R
is the statistical equivalent of perl with a lot of ready made modules. It may
be worth your while to have a quick chat with a local statistician or ring up a university department asking about bayesian medical inference. Such people have many simple methods at their fingertips and can give useful advice.
On to your question...
For each category that is loaded, the "-n -v" switch combination gives you minus the log probability of the sequence of words under the model. For example
This says there are 167 tokens that were used, and the probability under the spam model was 1/2^(3423.5), which is very small (close to 0%). You won't get meaningful percentages usually because each token of input is considered a clue in its own right, and the more clues it has, the more confident dbacl gets, so to speak. However, the number 20.50 above is the average evidence against the model per token, so you can compare 20.50 against a treshold (say 10). Any value below the treshold means you accept the model. How you choose the treshold is up to you :-(, but this approach is a standard one sided test procedure.
When dbacl loads two or more categories at once, it picks the lowest scoring category, which is equivalent to a mutually exclusive choice. But with the "-n -v" switches, every category score is listed in the output, so instead of picking a single best you can pick multiple categories if each of their scores are below a common or individual treshold. The actual scores are computed separately by dbacl, so it's safe to mix and match.
When you use the -U switch, you get a percentage, but it's only meaningful for a mutually exclusive choice, and it technically represents not the probability of the model itself, but instead the probability that dbacl thinks it made the correct choice based on the variability of the computations. This percentage is kind of an "unsure" tag. when the percentage is zero, it means two or more
categories are equally likely, when it's 100%, then the best choice is sufficiently separated from the other possibilities.
Your idea of creating an "all others" category is clever. It's possible to create pairs of categories such as "sepsis" + "not-sepsis", "respiratory" + "not-respiratory", etc. (ie your sample documents can be used for overlapping categories) Then you can compare "sepsis", "not-sepsis" as a single binary classification and use the -U percentage output to decide for that syndrome, and repeat "respiratory" versus "not-respiratory" etc. In general, this will give you different results than the score treshold method, but remember that all categories being compared are treated exclusively.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you! That is very helpful. The Chief Complaint and Diagnosis data I have is a very short number of words, with the same words appearing very frequently. Tons of acronyms, but they are well known. Something like
LOC HEAD LAC FALL
would be typical, meaning "loss of consciousness and head laceration resulting from a fall." Other records might have only one word, such as
PNEUMONIA
While there is a relatively small number of words, the way they are combined is pretty distinct for a given syndrome. I think dbacl will work pretty well.
I actually use R language inside PostgreSQL (plr) which is extremely cool. All I use it for right now is generating graphs, but could easily use it for more if I knew anything about statistics! Come to think about it, someone smarter than me could probably re-create some small subset of dbacl's functionality in plr along with tsearch just for this project, keeping everything neatly in the database. Hmm....
Thank you again! I have my work cut out for me now.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I agree that with this kind of input things are probably going to work quite well. Here are some other tips you might be able to use:
When the number of tokens is small (say ten at most), then the -N switch is approriate. It gives the true normalized posterior
probabilities instead of the logarithmic scores. So you would type for example
Here my categories are inappropriate, but you can see the (mutually exclusive) percentages rather than scores, and the percentages always sum to 100%. When the number of tokens gets too large, what happens is that the posterior probability concentrates on a single category, so you get 100% and 0%, which isn't helpful and you're better off using scores in that case. This is the common case with long pieces of text, and why I didn't mention -N.
By varying the categories on the command line, you can explore
conditional probabilities, ie what happens if you find that one category is inapproriate and want to exclude it etc. But remember that if you only give one category on the command line, then the posterior is 100% and you must look at the score instead.
And finally, you probably know this already, but the CRAN
(http://cran.r-project.org/) lists hundreds of packages, so you can certainly find something that does the same thing dbacl does and more. Look for packages which do Bayesian classification or regression, or contingency tables, that sort of thing.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi!
I have a system going that takes records from local emergency rooms and classifies them by "syndrome" based on rules written in SQL. It uses a full text index (tsearch2) in PostgreSQL and works just fine.
However, it would be a great extension of this system to also run a Bayesian test for new records based on prior classification to try to catch other subtle attributes of records that would be missed by the simple SQL word search. Also, this would allow use of probabilities rather than the binary AND/OR/NOT I currently have with SQL.
I have done a little testing, and I have dbacl proving it can do the job, but I am not sure if I am doing it the the best way.
Currently, each record is tested with a SQL select for each syndrome, and if it matches, it matches. There are cases where syndromes are mutually exclusive, in which case the SQL for the excluded syndrome is simply "AND NOT IN"'d on the the included syndrome's SQL.
I extracted data for a couple syndromes (records that matched the SQL select) and trained dbacl on them. I thought I could run them one at a time and use a threshold as indicated in the docs, but can't really figure it out. Something like "This record has a 24% probability of being an actual case of syndrome X" How do you get that kind of result for a single category test?
What I do have working is a comparison between several categories, and an "All Others" category. I think this is really the way dbacl was meant to work since it is analogous to "Spam/Notspam" and so many examples look that way. The only issue with that is that if the record matches multiple syndromes, the percentages get low as they are spread over multiple syndromes, so I don't know where to set my threshold. For example, a record could be 23% sepsis, 12% respiratory, 18% asthma, etc.
I think I would like to be able to do a single category test with a threshold to keep this more exactly like the existing system so I don't scare anyone when I try to explain it.
Hope this makes sense...
Ian
This sounds very interesting, but I'd like to make a caveat before going into the technical details.
dbacl is probably not the best tool for what you want to do. It classifies text by looking at all the words equally, so while it is true that it will pick up learned syndromes in free form text (I assume you are talking about free form text?) it can also pick up other artifacts, such as peoples' names if they occur frequently. If your records have a more rigid structure such as containing keywords only, then dbacl's assessment might be more meaningful medically.
The medical statistics community has developed much more appropriate methods of inference based on real models and
experience, and free systems such as BUGS (http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml) are designed to rapidly prototype such models graphically, and R
is the statistical equivalent of perl with a lot of ready made modules. It may
be worth your while to have a quick chat with a local statistician or ring up a university department asking about bayesian medical inference. Such people have many simple methods at their fingertips and can give useful advice.
On to your question...
For each category that is loaded, the "-n -v" switch combination gives you minus the log probability of the sequence of words under the model. For example
% cat mail/testing/octet | dbacl -c spam -n -v
spam 20.50 * 167.0
This says there are 167 tokens that were used, and the probability under the spam model was 1/2^(3423.5), which is very small (close to 0%). You won't get meaningful percentages usually because each token of input is considered a clue in its own right, and the more clues it has, the more confident dbacl gets, so to speak. However, the number 20.50 above is the average evidence against the model per token, so you can compare 20.50 against a treshold (say 10). Any value below the treshold means you accept the model. How you choose the treshold is up to you :-(, but this approach is a standard one sided test procedure.
When dbacl loads two or more categories at once, it picks the lowest scoring category, which is equivalent to a mutually exclusive choice. But with the "-n -v" switches, every category score is listed in the output, so instead of picking a single best you can pick multiple categories if each of their scores are below a common or individual treshold. The actual scores are computed separately by dbacl, so it's safe to mix and match.
% cat mail/testing/octet | dbacl -c spam -c notspam -n -v
spam 20.50 * 167.0 notspam 21.93 * 167.0
When you use the -U switch, you get a percentage, but it's only meaningful for a mutually exclusive choice, and it technically represents not the probability of the model itself, but instead the probability that dbacl thinks it made the correct choice based on the variability of the computations. This percentage is kind of an "unsure" tag. when the percentage is zero, it means two or more
categories are equally likely, when it's 100%, then the best choice is sufficiently separated from the other possibilities.
Your idea of creating an "all others" category is clever. It's possible to create pairs of categories such as "sepsis" + "not-sepsis", "respiratory" + "not-respiratory", etc. (ie your sample documents can be used for overlapping categories) Then you can compare "sepsis", "not-sepsis" as a single binary classification and use the -U percentage output to decide for that syndrome, and repeat "respiratory" versus "not-respiratory" etc. In general, this will give you different results than the score treshold method, but remember that all categories being compared are treated exclusively.
Thank you! That is very helpful. The Chief Complaint and Diagnosis data I have is a very short number of words, with the same words appearing very frequently. Tons of acronyms, but they are well known. Something like
LOC HEAD LAC FALL
would be typical, meaning "loss of consciousness and head laceration resulting from a fall." Other records might have only one word, such as
PNEUMONIA
While there is a relatively small number of words, the way they are combined is pretty distinct for a given syndrome. I think dbacl will work pretty well.
I actually use R language inside PostgreSQL (plr) which is extremely cool. All I use it for right now is generating graphs, but could easily use it for more if I knew anything about statistics! Come to think about it, someone smarter than me could probably re-create some small subset of dbacl's functionality in plr along with tsearch just for this project, keeping everything neatly in the database. Hmm....
Thank you again! I have my work cut out for me now.
I agree that with this kind of input things are probably going to work quite well. Here are some other tips you might be able to use:
When the number of tokens is small (say ten at most), then the -N switch is approriate. It gives the true normalized posterior
probabilities instead of the logarithmic scores. So you would type for example
% echo "loc head lac fall" | dbacl -c shake -c twain -Nv
shake 81.08% twain 18.92%
Here my categories are inappropriate, but you can see the (mutually exclusive) percentages rather than scores, and the percentages always sum to 100%. When the number of tokens gets too large, what happens is that the posterior probability concentrates on a single category, so you get 100% and 0%, which isn't helpful and you're better off using scores in that case. This is the common case with long pieces of text, and why I didn't mention -N.
By varying the categories on the command line, you can explore
conditional probabilities, ie what happens if you find that one category is inapproriate and want to exclude it etc. But remember that if you only give one category on the command line, then the posterior is 100% and you must look at the score instead.
And finally, you probably know this already, but the CRAN
(http://cran.r-project.org/) lists hundreds of packages, so you can certainly find something that does the same thing dbacl does and more. Look for packages which do Bayesian classification or regression, or contingency tables, that sort of thing.