How Should we Understand Confidence Intervals?

This is Wrong

Whenever you read the scientific literature, you'll frequently encounter estimates that include "confidence intervals." Generally speaking most of us understand these to give us an understanding of how sure scientists are that in their estimates they "got it right." And yet there is frequently confusion and even miscommunication about what confidence intervals mean that can sometimes unintentionally cause us to misrepresent (even if slightly) what scientists are actually claiming. It's something I've done in the past, and it's something that affects many well-intentioned people just trying to represent the data accurately. What I want to do in this post is to accurately describe what is being stated when confidence intervals (CIs) are reported, and then offer some reflections on the way this can impact our communication of the confidence we can have in scientific evidence.

Confidence vs Probability

At the heart of the confusion is a failure to distinguish between admittedly similar concepts of "confidence" and "probability." Let's start by defining these related concepts. The term "confidence" refers to how sure we are that an estimated result will fall within a certain range. However, "probability" is the odds of an outcome occurring. A couple examples here can help flesh out this distinction. Let's take an example of flipping a two-sided coin ten times. We can say with 100% confidence that the coin will land on "heads" between 0 and 10 times, but there is a ~50% probability that the coin will land on heads each time you flip the coin. We can also say there's a 100% probability that the coin will land on heads between 0 and 10 times, but it's better stated to use the word "confidence" in this instance.

Likewise, we can say that there is a 0% chance that if you flip a coin 10 times it will land on heads 11 times. We can also say that we have 0% confidence that the coin will land on heads 11 times in 10 flips, but it's better stated to use the word "confidence" here. Because we're 100% confident that the coin will land on heads between 0 and 10 times, given 10 flips, we are also 100% confident that it will not land on heads between 11 and ∞ times. We'll come back to this in a minute.

Sample Mean and Population Mean

Whenever we're attempting an average of a population, but the population is too large or complex to account for entirely, we choose a random sample of the total population and calculate the mean of that sample. We we then want to understand how well that sample mean (x̄) represents the mean of the total population (μ). So, for instance, we want to know the true value of  say the global mean surface temperature (GMST) anomaly. Yet there is no way to measure this directly. We need to use thermometers spread out over the surface of the earth, and from that representative sample, we calculate the sample mean and assess the likelihood how well that sample represents the entire population. We can do this many times. Each time we collect a sample and calculate its sample mean that adds to our understanding of the population mean. The more sample means we have, the better.

Confidence Levels and Intervals

Confidence intervals contain two aspects. The first is the confidence level (CL). This tells us the accuracy of our estimate. Confidence levels are usually described in percentages or standard deviations (σ) from a calculated mean. The larger the CL the greater the accuracy. Assuming a normal distribution, each standard deviation from a sample mean (x̄) will represent percentage of possible values. Here are some common CLs expressed in terms of standard deviation:

x̄ ± 1σ = 68% of values
x̄ ± 2σ = 95% of values
x̄ ± 3σ = 99.7% of values

The most common CL I see is 95% or 2σ. An estimate with this level of accuracy should contain 95% of possible values. The Confidence Interval (CI) is the range of values that are contained within the accuracy of the confidence level. This gives you a measure of the precision of the estimate. So to return the example of GMST anomalies, NASA calculated that 2020 averaged 1.02 C warmer than the 1951-1980 mean with a 2σ (95%) CI of ±0.05 C.  If NASA wanted greater accuracy, they could have provided a 3σ CL with a wider CI interval (less precise) or they wanted more precision, they could have reported a 1σ CL with a narrower CI interval (less accuracy).

What Does this All Mean?

So far so good. But here's the way this sometimes can be misinterpreted or misreported. What is often assumed is that when we're given a 95% confidence interval, there's a 95% chance that the true value (population mean) falls within the range given by the confidence interval. So in our above example, a common interpretation is that there's a 95% probability that the true GMST anomaly for 2020 falls with in 0.97 C and and 1.07 C. And that's not technically correct; at the very least, that's not what a 95% confidence interval means. It rather means, we can have 95% confidence that the calculated GMST anomaly CI includes the population mean, and those are two similar but different things.

Confidence Intervals are designed to express how well a sample represents a population. And for a population of sufficient complexity (like GMST), there are innumerable samples that could be used to calculate a sample mean. Think of each sample mean calculated from a sample as an experiment. What a 95% confidence interval means is that 95% of these experiments will calculate 95% CIs that overlap the population mean. The difference here may seem small, but it is significant and worth pointing out. Here is where this can be significant.

Let's suppose Johnny attempts to estimate the mass of an lead ball (known to have a mass of 1 kg) by using the density of lead (11.29 gram/cm^3 at 20 C) and the volume of water displaced by placing the lead ball into a tank of water. The experiment is done properly so that that the only source of error is random error. Johnny calculates the mass of the lead ball to be 0.92 kg with a 95% CI of ±0.05 kg. What is the probability that Johnny's 95% CI contains the true mass of the lead ball? It's not 95%; it's much closer to 0%. However, given that Johnny has a well-designed experiment without significant biases, repeated experiments will calculate 95% CIs that overlap 1 kg 95% of the time. But at the time Johnny conducted his first experiment, he could have 95% confidence that his experiment would overlap the population mean.

Implications for Climate Science

I think we can begin to reflect a bit on the implications of this for climate science. We can begin with perhaps most fundamental metric, the increase in GMST anomalies. Let's start by just comparing the 2020 GMST anomaly from several datasets, reported here with respect to the 1951-1980 mean with 95% confidence intervals:

Dataset            Anomaly (C)    Low        High

NASA                1.02 C             0.97 C      1.07 C

HadCRUT5       0.999 C           0.965 C    1.034 C

Berkeley            1.039  C          1.01 C      1.068 C

I chose these three because the 95% CIs were easy for me to find. When needed, I converted these to values with respect t the 1951-1980 baseline. The numbers are similar to each other, especially when seen in light of their respective CIs.

  1. Viewing any particular 2020 calculation as an "experiment," 95% of the times NASA repeats their experiment for 2020, the 95% CI should overlap the population mean, provided that biases are negligible.
  2. Multiple GMST datasets are incredibly helpful. Since these three estimates with their 95% CIs are very similar, that should increase our confidence that population mean for the GMST anomaly lies within the range of these three estimates. This to me is a good argument for replication studies and maintaining multiple datasets for GMST anomalies.
  3. The true value of the GMST anomaly is still unknown. What we know are several independent sample means, and we can have high confidence that these sample means represent the population mean for GMST anomaly.
  4. Let's imagine we had a true outlier here; say a sample mean of 0.90 C [0.83, 0.97]. Assuming the experiment is sound, the "outlier" shouldn't be discarded. But the likelihood that this estimate does not contain the population mean is higher than the other three. 

Even more important here is the recognition that scientific predictions related to climate change and the common contrarian technique of claiming that there have been "false predictions" that somehow falsify climate science. These claims of false predictions tend to follow now familiar patterns. Here's what frequently happens:

  1. The scientific literature claims with 95% confidence X will be in range R by a certain set of dates D.
  2. Somebody (politician, media personality) says that X could be at one end of range R by the beginning of the set of dates D.
  3. That beginning date comes and goes and X isn't yet at that end of range R, and so the blogosphere blows up saying that scientists predicted X would be at that end of Range R by that date, so all of climate science is wrong.
Here's an example. We hear frequently that climate science is wrong because Al Gore predicted that the Arctic would be free of summer ice by 2013. But is that what he said? Actually, no. While I can't vouch for every time he made his point related to Arctic Sea Ice, this seems representative of what he has said. “Some of the models suggest to Dr [Wieslav] Maslowski that there is a 75% chance that the entire north polar ice cap during some of the summer months could be completely ice-free within the next five to seven years.” So already we have problems. The contrarian claim to a failed prediction doesn't quote Al Gore properly. But did Al Gore explain the science properly? Not according to Dr. Maslowski. He writes, “It’s unclear to me how this figure was arrived at. I would never try to estimate likelihood at anything as exact as this.”[1] The study is available on line, and here's what the paper says:
When considering this part of the sea ice–volume time series, one can estimate a negative trend of −1,120 km3 year−1 with a standard deviation of ±2,353 km3 year−1 from combined model and most recent observational estimates for October–November 1996–2007. Given the estimated trend and the volume estimate for October–November of 2007 at less than 9,000 km3 (Kwok et al. 2009), one can project that at this rate it would take only 9 more years or until 2016 ± 3 years to reach a nearly ice-free Arctic Ocean in summer. Regardless of high uncertainty associated with such an estimate, it does provide a lower bound of the time range for projections of seasonal sea ice cover.[2]
 

Notice that even though the estimated trend would end up with an ice-free summer Arctic as early as 2013, the mean estimate was 2016, and the estimate of the negative trend had a huge standard deviation that overlaps 0. There can be very little confidence from this that we would experience an ice free Arctic by 2013. It's a "lower bound" estimate. Most models project an ice free summer Arctic sometime in the second half of the 21st century. In my opinion, it was wrong of Al Gore to make it appear like this was likely in the scientific literature, and it was equally wrong for people to misrepresent what Gore said.

And there's one more point we should make about this. If climate science is doing its work well, we would still expect about 5% or 1 in 20 of these predictions with a 95% CI to fail. So to show that climate science is wrong, we need to do more than show that one study made a flawed prediction. We need to show that significantly more than 5% of the predictions in climate science are failing. But so far I've not seen anyone attempt any meta-study showing how well the actual predictions of climate science are performing. The lesson here (at least to me) is don't believe a word that the media and politicians tell you until you look up the source in the scientific literature and get the original claim with the associated confidence interval.

References:
[1] https://www.reuters.com/article/factcheck-climate-change/fact-check-al-gore-did-not-predict-ice-caps-melting-by-2013-but-misrepresented-data-idUSL1N2RV0K6

[2] https://www.researchgate.net/publication/234145875_The_Future_of_Arctic_Sea_Ice

Comments

Popular posts from this blog

Roy Spencer on Models and Observations

The Marketing of Alt-Data at Temperature.Global

Patrick Frank Publishes on Errors Again