| **A good lession on Baysian statistics.** | dr hoo
*Jan 21, 2004 9:24 AM* | | If you use a spam filter, it likely uses Baysian stats. They have application to health issues (like false positives, false negatives in testing), and all sorts of other issues.
(paste from the link at the end of this post)
------------------
Here's a story problem about a situation that doctors often encounter:
1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
What do you think the answer is? If you haven't encountered this kind of problem before, please take a moment to come up with your own answer before continuing.
--------
Most doctors get this question wrong, btw. Like 85% of them!
If you want to learn, this link does a good job of explaining things in relatively simple terms.
http://yudkowsky.net/bayes/bayes.html |
| **Not to nitpick, but that's Bayesian to you, (nm)** | TJeanloz
*Jan 21, 2004 10:47 AM* | | |
| **yes, to nitpick. But I should know better.** | dr hoo
*Jan 21, 2004 10:56 AM* | | And I did it twice, so I can't claim typo!
Given your knowledge, if you took a look at it, do you think it does a good job for the layperson? |
| **I wouldn't have pointed it out if it hadn't surprised me** | TJeanloz
*Jan 21, 2004 11:03 AM* | | I mean, I don't expect most people to be right on that, but coming from you I couldn't resist.
I haven't actually read the link yet, is it the same as appeared in the Wall Street Journal yesterday? |
| **WSJ? Don't know. I picked it up from...** | dr hoo
*Jan 21, 2004 11:19 AM* | | ... kuro5hin.org which refered to a Nature article and a NYT article, as well as the link I posted.
http://www.kuro5hin.org/story/2004/1/20/43756/9394
What suprised me was the breadth of influence and use of Bayesian stats. I know that decision theories use it, and I know what jobs that employ bayesian stats can pay (boy do I know!) but claims like this:
"What is the so-called Bayesian Revolution now sweeping through the sciences, which claims to subsume even the experimental method itself as a special case?"
...are striking. It makes sense to me, given science as ongoing research programs and not single experiments, but seeing it stated that way took me back for a few seconds. |
| **Aren't you assuming...** | Tri_Rich
*Jan 21, 2004 12:43 PM* | | that those numbers are not independent variables? |
| **I'm not sure what you are asking.** | dr hoo
*Jan 21, 2004 3:45 PM* | | Could you clarify your question a bit? Which numbers? Which variables? |
| **I'm not sure what you are asking.** | Tri_Rich
*Jan 21, 2004 4:52 PM* | | If the error rate of false positives is independent of the other rates then the question of whether or not the woman has cancer is merely to probability of a true positive.
The 9.6% represents the probability that she does not have cancer given that she tested positive. Wouldn't a Bayesian approach apply to the question, what is the possibilty a 40 year old woman will have be correctly diagnosed with cancer?
I am looking at the textbook from when I had to take this in college, and remembering why I didn't like the class. |
| **Think simple.** | Continental
*Jan 21, 2004 6:30 PM* | | I think you're making it too complex. It's a simple calculation:
True positives /(false positives + true positives) |
| **Ah, that's why!** | dr hoo
*Jan 21, 2004 6:38 PM* | | not to be pedantic, but...
You start with asking about error, and error is *not* part of the model that is easy to talk about. All models are MODEL + error. We like to ignore the error, at least once we have an idea how big it is. Anything more is really MODEL and not error any more.
Much of the error is associated with prior probabilities. Just stick all the error over there for now. Trust me on this one. It's the big stick of statistics. A or ~A. Model or ~Model.
Now, a gratuitous and pompous paste from the link:
--------------
Q. How can I find the priors for a problem?
A. Many commonly used priors are listed in the Handbook of Chemistry and Physics.
Q. Where do priors originally come from?
A. Never ask that question.
Q. Uh huh. Then where do scientists get their priors?
A. Priors for scientific problems are established by annual vote of the AAAS. In recent years the vote has become fractious and controversial, with widespread acrimony, factional polarization, and several outright assassinations. This may be a front for infighting within the Bayes Council, or it may be that the disputants have too much spare time. No one is really sure.
Q. I see. And where does everyone else get their priors?
A. They download their priors from Kazaa.
Q. What if the priors I want aren't available on Kazaa?
A. There's a small, cluttered antique shop in a back alley of San Francisco's Chinatown. Don't ask about the bronze rat.
------------
So, now that I've dealt with the first 3 words of your post, lets move on to word 4. There are some issues with the choice of the word "rate" that will take significant discussion!
----can't----go--------on!
The sad things is, I really can. And I will!
Right, let's go to the graphical aide, assuming your java applets work. It's about 1/3 of the way down the original link http://yudkowsky.net/bayes/bayes.html , and has the number 9.6% in a little box.
This is the FALSE POSITIVE RATE for this cancer test. So given a POSITIVE test, we get FALSE POSITIVE tests 9.6% of the time. This means someone who tests POSITIVE, but really does NOT have cancer.
Is that a BAD mistake to make? Not really, a retest should be done, which will determine cancer. Likely a biopsy. Relatively minimal expense, but not death. As a side benefit we know the error rate of our original test BY doing the retests!
Now, one of the goals of the Bayesian model is to NARROW things down. Decision theory, focus resources, maximize something... but you must have a test with SOME degree of accuracy. The applet shows this, with a large rectangle tapering down to a small rectangle.
Start by INCIDENCE. In the case of this chart and example, cancer in a population (2%). It could also be something GOOD by the way, but mostly in medicine it is not so good.
80% accuracy for the test CATCHING the cancer, a TRUE POSITIVE. As already said, the FALSE POSITIVE is 9.6%
Now, since almost everyone that tests positive will get RETESTED, our error rates for test-retest will go down significantly. (reasons lie outside the scope of this discussion) So FALSE POSITIVES will get caught almost every time.
Plug any different number into any of those boxes in the applet, and look at how the graph and the other numbers change. Then think about what those numbers mean.
The true positives are no problem, they are sick and know it. They can be treated.
The false positives are no problem, as they will get retested. Minimal expense error, better safe than sorry. And for a cancer retest (biopsy), we have a MORE accurate test.
True negatives are no problem, they are not sick and know it.
False negative are people that will die. Painfully. That is a BAD thing in a medical test. It might not be so bad in other contexts.
Given any change in the m |
| **-- continued, though somewhat less pedantic** | dr hoo
*Jan 21, 2004 6:41 PM* | | Given any change in the model (plug in whatever numbers you want) the outcome of this box will move. This is the IMPORTANT information. You need to focus on THIS box and minimize the NUMBER or FREQUENCY of or PROBABILITY of this outcome. It is BAD!!!!! If it is high that is.
If it were good, you would want to play with the numbers to make it happen more.
That's all I have off the top of my head. Good question! I probably made all sorts of mistakes, but I don't teach stats, I only use them. Talking about them in other than equations always confuses me. |
| **The real problem...** | Tri_Rich
*Jan 22, 2004 6:06 AM* | | is that this sentence does not say what I at first thought it said.
9.6% of women without breast cancer will also get positive mammographies.
This is not the same as 9.6% of positives are false, which was the way I originally read it. |
| **The lesson needs a little extension** | Continental
*Jan 21, 2004 2:19 PM* | | Disclaimer: this is a quick question, late in afternoon.
My understanding is that the key to the Bayesian statistics is that you know something about the population you are analyzing. In this case you know that 1% of the population has cancer. If you have the same test but don't know what percent of the population has cancer, you need to use conventional statistics.
The Bayesian statistics are so much more powerful than conventional statistics, that the even using what you think you know about the population may give a better answer than conventional statistics. For example, in this case you may not know exactly how many women have cancer, but you're sure that it is more than 0.5% and less than 2%. You can get better results on small populations running Bayesian statistics with both estimates than you can using conventional statistics.
Is this about right? At least good enough to fool my boss? |
| **I am tempted to say...** | dr hoo
*Jan 21, 2004 6:57 PM* | | ...look in the above answer. You ask about the model, but your question leads to discussing error. Tri_Rich asked about error, but that really led to discussing the model.
So, really, look in the reply to Tri_Rich.
Start with the cancer equations, drop in estimates where you have no knowledge, and minimize the number/frequency/probability of your boss being NOT FOOLED and you KNOWING it (true negative), or NOT FOOLED and you DON'T KNOW it (false positive).
See, we have a problem already. There are TWO test outcomes that are good for you (FOOLED boss) and TWO that are bad for you, (NOT FOOLED boss). In the cancer "test" or "estimates", there was only ONE outcome that we needed to alter. False Negative, which led to death. If you really only want to MOVE that one box a certain direction, you can play with any other number.
If we try to apply your problem, the (four outcome/one that matters) analysis colapses to a (one outcome/other outcome) analysis. Then the error in the estimates CAN overwhelm MODEL, and you will never know it. So you gain no useful information by employing the model.
As for the error rate of you being able to "fool your boss", I certainly would estimate that with less accuracy than YOU.
dr (math in words is hard to do) hoo |
| **There's really only one bad outcome** | Continental
*Jan 22, 2004 7:00 AM* | | Not fooling the boss and not knowing it could really suck. Not fooling the boss but knowing he is not fooled can be handled with minimal damage. Quite an arcane discussion for a non-cycling forum, but it's been informative and thought provoking. Thanks. |
| **Arcane? Yeah, it is.** | dr hoo
*Jan 22, 2004 8:15 AM* | | I'm glad you enjoyed it. I know TRYING to explain it helped me understand it better. Especially with regards to spam filters, even though I did not talk about that in this thread. You can probably see how adjusting the probabilities of spam (based on previous messages, during the "training phase" bayesian filters use) makes the spam filters more efficient over time. Start with rough estimates of priors and sharpen them with each case.
The medical cases I have run across and applied, but the fun (so to speak) is applying the general concepts to more than one situation. I may need to do some deeper reading on this stuff. |
| |