## 2020

STATISTICS ROUNDTABLE

# In a Certain Way

## Quantifying uncertainty for meaningful and objective results

by Christine M. Anderson-Cook

Whenever we estimate a population parameter from a sample, in addition to providing a point estimate, we should also include an interval to characterize the associated uncertainty.

Let’s explore what the confidence level means (the “95” in a 95% confidence interval), how statistical intervals differ from subjective ones and where it is common to have problems with the quantification of level if we are dealing with subjective intervals. Our ability to understand and explain uncertainty is instrumental to the good use of statistical methods.

Consider collecting a representative sample of parts from a stable manufacturing process, with the goal of estimating the mean population length of the parts by looking at the average sample length. Suppose you observe the average sample length to be 14.18 inches. It is unlikely our sample estimate will exactly match the true population value, but we are counting on it to be relatively close.

Without a range to quantify our uncertainty, it is difficult to interpret your results meaningfully. Could the mean population length be as large as 14.21 inches? 14.63 inches? 16.81 inches? What does relatively close really mean?

Suppose the 95% confidence interval based on our sample is 14.18 +/- 0.73 = [13.45, 14.91]. We can describe this as the range of sensible values for the population mean length based on the information gained from the observed sample. The 95% confidence level implies that if we repeated the procedure of collecting a sample many times, the resulting intervals will include the true population mean length 95% of the time.

This is not the same as saying there is a 95% chance the population length is contained in [13.45, 14.91]. For our particular interval based on a single sample, the true but unknown population value is either in the interval or not. But how should you think about the confidence level associated with this range?

### Quiz time

On the surface, it appears to be relatively simple. Yet many of us are weak at appropriately describing and quantifying our uncertainty. We are chronically overoptimistic about what we know.

For example, consider the 20-question quiz shown in Table 1. The questions all have known answers (if only you could get to a computer to do a few Google searches, right?). Take a moment to answer the questions using only your current knowledge.

Really—stop reading and fill in the blanks, but don’t use the web or anything else to find answers. This will give us a basis for discussion for how well you can quantify your uncertainty.

The goal is to provide intervals for each question that correctly capture your knowledge by making them wide enough that you are confident 95% of the intervals will correctly contain the true value.

Before interpreting the quiz results, let’s compare what the 95% confidence intervals and the exercise you have just done have in common and where they differ. One similarity is that the goal of the intervals is the same: to bound sensible values for a fixed but unknown value, calibrated by the long-run average of the true values being contained in their respective intervals 95% of the time.

Note that for a particular question, we are not sure whether the answer will be correct. We only know that over a large range of questions, we should maintain a specified standard.

One important difference is that a confidence interval is based on information obtained from a specific set of data. The quiz intervals were not based on data, but rather on less precise expert knowledge (OK, some of us may not want to claim to be experts on the topics of these questions). Hence, the confidence interval is objective; namely, for a given sample, we would all obtain the same answer, as long as we applied the appropriate method correctly.

The quiz intervals are clearly subjective. They are dependent on what information and background we have available. Even from one day to the next, we might answer a particular question differently. This leads to another difference between the two sets of intervals: Every time we use the confidence interval procedure, the chance of obtaining a sample that leads to a confidence interval that correctly includes the true value is the same.

Because we are each using a different collection of information for the answers to the subjective quiz, however, we are more likely to get some questions correct than others.

### Calibrating uncertainty

Next, we explore our own calibration of uncertainty for the quiz. Take a moment to match your intervals with the answers listed in Table 2. Mark your answer correct if the true value lies between your lower and upper bound. Give yourself a score out of 20. Remember from the instructions, your goal was to make the bounds wide enough to get 95% (19 of 20 answers) correct.

How did you do? If you’re average, you probably were not too close to the target value of 19 of 20. Table 3 shows a stem-and-leaf plot of the results of a nonscientifically obtained sample of my friends and colleagues (all of whom are trained in mathematics, statistics or engineering).

Recall that a stem-and-leaf plot is similar to a histogram, but it gives the detailed values of the quiz scores. The stems (left of line) give the first digit of the score out of 20, and the leaf (right of line) gives the second digit. For example, the top line of the plot with “0 |& 0 1” means the two lowest scores were “00” (or 0) and “01” (or 1) out of 20. The highest score, denoted by
“1 | 8”, was 18.

Why are so many of us so consistently miscalibrated? Examine your own thought processes for selecting an answer. Often, the incorrect answers are the ones for which we would have said that we were more confident about the answer, and thus we did not build in enough margin of error to capture our true uncertainty.

There is also a sense that to be a good expert, we need to be precise about our answer for it to be useful, and so we are reluctant to build appropriate width into the intervals.

Subconsciously, we are reluctant to reveal how little we know. The majority of my participants answered only 40 to 70% of the questions correctly. The expert elicitation literature suggests this is quite typical. The only person in my sample who came close to achieving the target of 95% correct (with 18/20) admitted he first came up with a set of ranges, and then he intentionally doubled the widths of the intervals. Research has shown while there are many things that experts can estimate well, incorporating appropriate uncertainty is done poorly on a consistent basis, regardless of quantitative training.1

### So what?

Why does it matter that we do not quantify uncertainty well? First, when presenting results, it is important to define what an interval means. We may see results published in the form “14.18 +/- 0.73” with no explanation given to whether this is a confidence interval or a mean plus or minus some number of standard deviations.

Without this additional information, the associated confidence level is left a mystery. Understanding how the interval should be interpreted and precisely specifying the definition of an interval in our own presentations will make the quantitative result much more meaningful.

Second, an awareness of this systematic miscalibration should lead to improvements in our own ability to think about uncertainty and to interpret other subjective intervals that we encounter in our lives.

Finally, given these problems with subjective intervals, it should make us appreciate available statistical tools, and provide for consistently meaningful and objective results to our scientific investigations.

### Reference

1. Mary A. Meyer and Jane M. Booker, Eliciting and Analyzing Expert Judgment, ASA Siam, 2001.

Christine M. Anderson-Cook is a research scientist at Los Alamos National Laboratory in Los Alamos, NM. She earned a doctorate in statistics from the University of Waterloo in Ontario, Canada. Anderson-Cook is a fellow of the American Statistical Association and a senior member of ASQ.

Out of 0 Ratings