1
00:00:02,530 --> 00:00:07,220
You've tested positive for a rare and deadly
cancer that afflicts 1 out of 1000 people,
2
00:00:07,220 --> 00:00:11,980
based on a test that is 99% accurate. What
are the chances that you actually have the
3
00:00:11,980 --> 00:00:17,560
cancer? By the end of this video, you'll be
able to answer this question!
4
00:00:17,560 --> 00:00:22,510
This video is part of the Probability and
Statistics video series. Many natural and
5
00:00:22,510 --> 00:00:28,260
social phenomena are probabilistic in nature.
Engineers, scientists, and policymakers often
6
00:00:28,260 --> 00:00:31,930
use probability to model and predict system
behavior.
7
00:00:31,930 --> 00:00:38,469
Hi, my name is Sam Watson, and I'm a graduate
student in mathematics at MIT.
8
00:00:38,469 --> 00:00:43,679
Before watching this video, you should be
familiar with basic probability vocabulary
9
00:00:43,679 --> 00:00:46,989
and the definition of conditional probability.
10
00:00:46,989 --> 00:00:51,219
After watching this video, you'll be able
to: Calculate the conditional probability
11
00:00:51,219 --> 00:00:56,269
of a given event using tables and trees; and
Understand how conditional probability can
12
00:00:56,269 --> 00:01:03,269
be used to interpret medical diagnoses.
13
00:01:04,479 --> 00:01:11,479
Suppose that in front of you are two bowls,
labeled A and B. Each bowl contains five marbles.
14
00:01:11,600 --> 00:01:18,240
Bowl A has 1 blue and 4 yellow marbles. Bowl
B has 3 blue and 2 yellow marbles.
15
00:01:18,240 --> 00:01:23,229
Now choose a bowl at random and draw a marble
uniformly at random from it. Based on your
16
00:01:23,229 --> 00:01:28,200
existing knowledge of probability, how likely
is it that you pick a blue marble? How about
17
00:01:28,200 --> 00:01:33,109
a yellow marble?
18
00:01:33,109 --> 00:01:40,109
Out of
the 10 marbles you could choose from, 4 are
blue. So the probability of choosing a blue
19
00:01:55,130 --> 00:01:58,020
marble is 4 out of 10.
20
00:01:58,020 --> 00:02:03,109
There are 6 yellow marbles out of 10 total,
so the probability of choosing yellow is 6
21
00:02:03,109 --> 00:02:03,270
out of 10.
22
00:02:03,270 --> 00:02:04,109
When the number of possible outcomes is finite,
and all events are equally likely, the probability
23
00:02:04,109 --> 00:02:05,009
of one event happening is the number of favorable
outcomes divided by the total number of possible
24
00:02:05,009 --> 00:02:05,070
outcomes.
25
00:02:05,070 --> 00:02:09,470
What if you must draw from Bowl A? What's
the probability of drawing a blue marble,
26
00:02:09,470 --> 00:02:16,470
given that you draw from Bowl A?
27
00:02:18,239 --> 00:02:25,239
Let's go back to the table and consider only
Bowl A. Bowl A contains 5 marbles of which
28
00:02:25,599 --> 00:02:31,300
1 is blue, so the probability of picking a
blue one is 1 in 5.
29
00:02:31,300 --> 00:02:36,610
Notice the probability has changed. In the
first scenario, the sample space consists
30
00:02:36,610 --> 00:02:42,330
of all 10 marbles, because we are free to
draw from both bowls.
31
00:02:42,330 --> 00:02:48,220
In the second scenario, we are restricted
to Bowl A. Our new sample space consists of
32
00:02:48,220 --> 00:02:55,220
only the five marbles in Bowl A. We ignore
these marbles in Bowl.
33
00:02:55,670 --> 00:03:00,909
Restricting our attention to a specific set
of outcomes changes the sample space, and
34
00:03:00,909 --> 00:03:07,730
can also change the probability of an event.
This new probability is what we call a conditional
35
00:03:07,730 --> 00:03:08,370
In the previous example, we calculated the
conditional probability of drawing a blue
36
00:03:08,370 --> 00:03:09,659
marble, given that we draw from Bowl A.
37
00:03:09,659 --> 00:03:14,510
This is standard notation for conditional
probability. The vertical bar ( | ) is read
38
00:03:14,510 --> 00:03:21,510
as "given." The probability we are looking
for precedes the bar, and the condition follows
39
00:03:25,099 --> 00:03:32,099
the bar.
40
00:03:32,299 --> 00:03:37,439
Now let's flip things around. Suppose someone
picks a marble at random from either bowl
41
00:03:37,439 --> 00:03:43,939
A or bowl B and reveals to you that the marble
drawn was blue. What is the probability that
42
00:03:43,939 --> 00:03:46,989
the blue marble came from Bowl A?
43
00:03:46,989 --> 00:03:52,159
In other words, what's the conditional probability
that the marble was drawn from Bowl A, given
44
00:03:52,159 --> 00:03:59,159
that it is blue? Pause the video and try to
work this out.
45
00:04:01,159 --> 00:04:06,400
Going back to the table, because we are dealing
with the condition that the marble is blue,
46
00:04:06,400 --> 00:04:11,170
the sample space is restricted to the four
blue marbles.
47
00:04:11,170 --> 00:04:17,290
Of these four blue marbles, one is in Bowl
A, and each is equally likely to be drawn.
48
00:04:17,290 --> 00:04:22,320
Thus, the conditional probability is 1 out
of 4.
49
00:04:22,320 --> 00:04:27,160
Notice that the probability of picking a blue
marble given that the marble came from Bowl
50
00:04:27,160 --> 00:04:33,030
A is NOT equal to the probability that the
marble came from Bowl A given that the marble
51
00:04:33,030 --> 00:04:40,030
was blue. Each has a different condition,
so be careful not to mix them up!
52
00:04:42,919 --> 00:04:47,650
We've seen how tables can help us organize
our data and visualize changes in the sample
53
00:04:47,650 --> 00:04:49,060
space.
54
00:04:49,060 --> 00:04:53,180
Let's look at another tool that is useful
for understanding conditional probabilities
55
00:04:53,180 --> 00:04:55,870
- a tree diagram.
56
00:04:55,870 --> 00:05:02,870
Suppose we have a jar containing 5 marbles;
2 are blue and 3 are yellow. If we draw any
57
00:05:03,220 --> 00:05:08,280
one marble at random, the probability of drawing
a blue marble is 2/5.
58
00:05:08,280 --> 00:05:14,280
Now, without replacing the first marble, draw
a second marble from the jar. Given that the
59
00:05:14,280 --> 00:05:20,330
first marble is blue, is the probability of
drawing a second blue marble still 2/5?
60
00:05:20,330 --> 00:05:27,250
NO, it isn't. Our sample space has changed.
If a blue marble is drawn first, you are left
61
00:05:27,250 --> 00:05:31,569
with 4 marbles; 1 blue and 3 yellow.
62
00:05:31,569 --> 00:05:36,130
In other words, if a blue marble is selected
first, the probability that you draw blue
63
00:05:36,130 --> 00:05:42,580
second is 1/4. And the probability you draw
yellow second is 3/4.
64
00:05:42,580 --> 00:05:49,580
Now pause the video and determine the probabilities
if the yellow marble is selected first instead.
65
00:05:54,660 --> 00:06:00,539
If a yellow marble is selected first, you
are left with 2 yellow and 2 blue marbles.
66
00:06:00,539 --> 00:06:06,389
There is now a 2/4 chance of drawing a blue
marble and a 2/4 chance of drawing a yellow
67
00:06:06,389 --> 00:06:08,060
marble.
68
00:06:08,060 --> 00:06:15,060
What we have drawn here is called a tree diagram.
The probability assigned to the second branch
69
00:06:18,550 --> 00:06:21,660
denotes the conditional probability given
that the first happened.
70
00:06:21,660 --> 00:06:23,139
Tree diagrams help us to visualize our sample
space and reason out probabilities.
71
00:06:23,139 --> 00:06:27,500
We can answer questions like "What is the
probability of drawing 2 blue marbles in a
72
00:06:27,500 --> 00:06:32,550
row?" In other words, what is the probability
of drawing a blue marble first AND a blue
73
00:06:32,550 --> 00:06:34,479
marble second?
74
00:06:34,479 --> 00:06:39,880
This event is represented by these two branches
in the tree diagram.
75
00:06:39,880 --> 00:06:46,880
We have a 2/5 chance followed by a 1/4 chance.
We multiply these to get 2/20, or 1/10. The
76
00:06:48,849 --> 00:06:53,349
probability of drawing two blue marbles in
a row is 1/10.
77
00:06:53,349 --> 00:06:58,729
Now you do it. Use the tree diagram to calculate
the probabilities of the other possibilities:
78
00:06:58,729 --> 00:07:05,729
blue, yellow; yellow, blue; and yellow, yellow.
79
00:07:10,050 --> 00:07:16,750
The probabilities each work out to 3/10. The
four probabilities add up to a total of 1,
80
00:07:16,750 --> 00:07:18,800
as they should.
81
00:07:18,800 --> 00:07:22,599
What if we don't care about the first marble?
We just want to determine the probability
82
00:07:22,599 --> 00:07:26,330
that the second marble is yellow.
83
00:07:26,330 --> 00:07:30,699
Because it does not matter whether the first
marble is blue or yellow, we consider both
84
00:07:30,699 --> 00:07:37,699
the blue, yellow, and the yellow, yellow paths.
Adding the probabilities gives us 3/10 + 3/10,
85
00:07:38,099 --> 00:07:41,139
which works out to 3/5.
86
00:07:41,139 --> 00:07:45,190
Here's another interesting question. What
is the probability that the first marble drawn
87
00:07:45,190 --> 00:07:48,819
is blue, given that the second marble drawn
is yellow?
88
00:07:48,819 --> 00:07:54,050
Intuitively, this seems tricky. Pause the
video and reason through the probability tree
89
00:07:54,050 --> 00:08:01,050
with a friend.
90
00:08:01,370 --> 00:08:05,680
Because we are conditioning on the event that
the second marble drawn is yellow, our sample
91
00:08:05,680 --> 00:08:09,289
space is restricted to these two paths: P(blue,
yellow) and P(yellow, yellow).
92
00:08:09,289 --> 00:08:14,690
Of these two paths, only the top one meets
our criteria - that the blue marble is drawn
93
00:08:14,690 --> 00:08:16,759
first.
94
00:08:16,759 --> 00:08:21,919
We represent the probability as a fraction
of favorable to possible outcomes. Hence,
95
00:08:21,919 --> 00:08:26,199
the probability that the first marble drawn
is blue, given that the second marble drawn
96
00:08:26,199 --> 00:08:33,199
is yellow is 3/10 divided by (3/10 +3/10),
which works out to 1/2.
97
00:08:33,450 --> 00:08:39,000
I hope you appreciate that tree diagrams and
tables make these types of probability problems
98
00:08:39,000 --> 00:08:46,000
doable without having to memorize any formulas!
99
00:08:47,210 --> 00:08:51,830
Let's return to our opening question. Recall
that you've tested positive for a cancer that
100
00:08:51,830 --> 00:08:56,780
afflicts 1 out of 1000 people, based on a
test that is 99% accurate.
101
00:08:56,780 --> 00:09:02,950
More precisely, out of 100 test results, we
expect about 99 correct results and only 1
102
00:09:02,950 --> 00:09:05,100
incorrect result.
103
00:09:05,100 --> 00:09:10,290
Since the test is highly accurate, you might
conclude that the test is unlikely to be wrong,
104
00:09:10,290 --> 00:09:13,110
and that you most likely have cancer.
105
00:09:13,110 --> 00:09:19,320
But wait! Let's first use conditional probability
to make sense of our seemingly gloomy diagnosis.
106
00:09:19,320 --> 00:09:20,850
Now pause the video and determine the probability
that you have the cancer, given that you test
107
00:09:20,850 --> 00:09:20,940
positive.
108
00:09:20,940 --> 00:09:24,470
Let's use a tree diagram to help with our
calculations.
109
00:09:24,470 --> 00:09:30,650
The first branch of the tree represents the
likelihood of cancer in the general population.
110
00:09:30,650 --> 00:09:36,910
The probability of having the rare cancer
is 1 in 1000, or 0.001. The probability of
111
00:09:36,910 --> 00:09:41,790
having no cancer is 0.999.
112
00:09:41,790 --> 00:09:46,260
Let's extend the tree diagram to illustrate
the possible results of the medical test that
113
00:09:46,260 --> 00:09:49,020
is 99% accurate.
114
00:09:49,020 --> 00:09:56,020
In the cancer population, 99% will test positive
(correctly), but 1% will test negative (incorrectly).
115
00:09:57,150 --> 00:10:01,200
These incorrect results are called false negatives.
116
00:10:01,200 --> 00:10:07,520
In the cancer-free population, 99% will test
negative (correctly), but 1% will test positive
117
00:10:07,520 --> 00:10:13,590
(incorrectly). These incorrect results are
called false positives.
118
00:10:13,590 --> 00:10:19,270
Given that you test positive, our sample space
is now restricted to only the population that
119
00:10:19,270 --> 00:10:24,920
test positive. This is represented by these
two paths.
120
00:10:24,920 --> 00:10:30,760
The top path shows the probability you have
the cancer AND test positive. The lower path
121
00:10:30,760 --> 00:10:37,440
shows the probability that you don't have
cancer AND still test positive.
122
00:10:37,440 --> 00:10:44,440
The probability that you
actually do have the cancer, given that you
test positive, is (0.001*0.99)/((0.001*0.99)+(0.999*0.01)),
123
00:10:55,720 --> 00:11:01,150
which works out to about 0.09 - less than
10%!
124
00:11:01,150 --> 00:11:06,880
The error rate of the test is only 1 percent,
but the chance of a misdiagnosis is more than
125
00:11:06,880 --> 00:11:13,550
90%! Chances are pretty good that you do not
actually have cancer, despite the rather accurate
126
00:11:13,550 --> 00:11:16,760
test. Why is this so?
127
00:11:16,760 --> 00:11:22,440
The accuracy of the test actually reflects
the conditional probability that one tests
128
00:11:22,440 --> 00:11:25,070
positive, given that one has cancer.
129
00:11:25,070 --> 00:11:30,520
But in practice, what you want to know is
the conditional probability that you have
130
00:11:30,520 --> 00:11:37,520
cancer, given that you test positive! These
probabilities are NOT the same!
131
00:11:37,520 --> 00:11:42,550
Whenever we take medical tests, or perform
experiments, it is important to understand
132
00:11:42,550 --> 00:11:47,260
what events our results are conditioned on,
and how that might affect the accuracy of
133
00:11:47,260 --> 00:11:53,180
our conclusions.
134
00:11:53,180 --> 00:11:57,180
In this video, you've seen that conditional
probability must be used to understand and
135
00:11:57,180 --> 00:12:02,810
predict the outcomes of many events. You've
also learned to evaluate and manage conditional
136
00:12:02,810 --> 00:12:06,830
probabilities using tables and trees.
137
00:12:06,830 --> 00:12:11,310
We hope that you will now think more carefully
about the probabilities you encounter, and
138
00:12:11,310 --> 00:12:14,200
consider how conditioning affects their interpretation.