How Good Are My Data?
Understanding and testing for the pitfalls of data that look too good to be true
by Julia E. Seaman and I. Elaine Allen
For many everyday uses, applied statistics is used to prove superiority, inferiority or equivalence, or test for outliers. Generally, this means that the statistical assumptions are looking for differences among groups in the measured outcomes and standard statistical tests such as Student’s t-test or analysis of variance (ANOVA) provided results.
For example, a generic manufacturing research group is measuring the cleaning power between its new formulation of soap and the original product to test whether the new one is as good. After a few tests, it becomes clear that the new soap formulation appears identical to the original in every dimension.
Is this new formulation a perfect generic product, or are the equivalent results too good to be true?
How do I know if my data are too good? What checks should I institute to answer this question—especially if I didn’t gather the data myself?
In 1936, R.A. Fisher reexamined Gregor Mendel’s data on garden peas and plant hybridization1 and speculated that there were misclassification errors in Mendel’s analysis. It seemed that the data were too good to be true. His conclusion was based on simple statistical analyses (chi-squared tests) and concluded that the expected fit of the data was too close to the observed data (p > 0.9999).
While the debate on Mendel’s data gathering and motives continues, the past 10 years have seen an increase in attention to data and scientific integrity, and an increase in retractions of papers based on the quality of the data reported.2 The Neuroskeptic blog3 from Discover magazine has covered results of published experiments in which the a priori predictions fit the data almost perfectly and what statistical techniques were used to detect this.4
This article is not interested in uncovering data fraud, but the methods used to establish data integrity—not when the data are too messy or noisy, but the opposite: when the data appear to fit too well.
Not just in science and medicine, an organization’s reputation also may be at stake, and falsifying information showing compliance for products not meeting regulations can, if uncovered, result in enormous fines (as seen with Volkswagen and its emission scandal a few years ago5).
What is data integrity? The U.S. Food and Drug Administration defines it as complete, consistent and accurate data that should be attributable, legible and contemporaneously recorded with an original or true copy.6 A search uncovers websites with three, seven, 11 and even 15-point checklists to ensure data integrity. Most deal with uncovering technical errors, change control and safe storage rather than examining the results of an analysis that fit the data too well.
From its earliest identified uses,7 statistics has been focused on the differences between groups—with Fisher’s guidelines defining the p = 0.05 significance level based on testing agricultural practices across different trials.8 And superiority, equivalence and outlier testing, which can be more complex, continue the focus on the differences (or absence of) among groups.
To determine data integrity, however, it is necessary to look at the similarity among groups. Therefore, there are some new techniques and altered uses of common tools to test whether data are too good.
The first steps to determine data integrity are the same as beginning any analysis: data curation. The process to check for outliers and entry errors also may show issues with data integrity. It is good to check for duplicative data, a lot of edited entries or different styles of entry such as varying numbers of significant figures.
Simple descriptive statistics may reveal data integrity problems. The range, mean and standard deviation of samples should show the variation expected with that data type. In Table 1, for example, two testers measured 10 random samples three times, creating 60 points of data. Analyzing the data reveals that there are important similarities and differences between testers one and two.
Specifically, the samples from tester one all have the same standard deviation value showing no variability while the overall mean is identical to that of tester two.
While this alone may not indicate that the three replicates from tester one are inaccurate, further investigation shows that the averages and ranges from tester one are much smaller and much less variable than those from tester two. The values from tester one are suspicious, and we can use the data to perform a statistical analysis to confirm the suspicion.
Using the data in Table 1, both testers had identical means of 50, but with different standard deviations (tester one = 0.32 and tester two = 1.13). A test comparing the means shows no difference, but a test comparing the variances gives a statistically significant difference.
Suppose the long-term mean is known to be 50 with a standard deviation of 1.5. Here, there would be no statistically significant difference compared to tester two’s standard deviation of 1.13 and a highly significant difference compared to tester one.
Therefore, tester one’s data are statistically unlikely and should be investigated for potential measurement errors or fraud.
If possible, examining the data generation process can reveal areas that are vulnerable to intentional or unintentional data tampering. Ideally, the data were gathered following a specified protocol and contain a record of how and when specific data entered the database (change control). Any place in the process that requires an active step has the potential to bias the experiment.
Often, common human or technical errors may be the source of the issue—a mistyped key or misscanned barcode accidentally ran the same sample twice creating duplicate data, for example.
Additionally, this process may reveal more variables that may influence the data quality, such as time of day, temperature, measurement device and personnel.
Statistical tests can be used for further testing of suspicious data. There are multiple approaches to analyze the data depending on the situation.
- Hypothesis testing (“The results are too close to the truth”): If a data set is too good, it is possible it may have been reverse-engineered from the known testing hypothesis. This is the claim that Fisher made on Mendel’s data: It is too perfect to be accurate. For this analysis, Fisher used a chi-squared statistic and an exact test (now known as Fisher’s exact test) to see whether the observed and expected values were too close. In any randomized experiment, some variability is expected and to see identical, equivalent results is as unusual as extremely different results (see the second approach).
- Historical data testing (“The results are too far from the standard”): If we are testing a new therapy vs. standard of care of the same type, it would be suspicious to see results showing the new therapy to be so far different from the standard of care unless there is an extremely large sample size. These analyses typically involve comparing treatment results by important factors of the randomized groups in an ANOVA or general linear model.
- Reproducibility testing (“The quality of my results doesn’t vary—or varies too much”): Diagnostic tests using biologic assays require the use of controls to ensure that the results are valid. If repeated testing (using a repeated-measures statistical method) shows values over time that are varying too much or too little from historic data, a new control may be needed. Control charts can show the change in the control over time, and whether there is absolutely no change in the control value or a trend indicating a new control is needed.
- Process changes (“The process shows no variability and no defects over time”): While quality control and quality assurance methods are designed to show that the process is going out of control, rejections for defects are expected. Processes involving human intervention and machines may be subject to variability by temperature, power outages and components wearing out. To see small or no variability is a signal that the data are just too good.
Becoming a bigger focus
Data integrity is an important step prior to analysis of a data set. While most attention is paid to outliers, out-of-control processes and data errors, understanding and testing for the pitfalls of data that look too good is equally important.
It is becoming a bigger focus as competition for treatments and therapies moves to biosimilars for the pharmaceutical and biotechnology industry, and for all regulatory organizations as technology changes, new regulations take effect and the global supply chain for a product is truly global in terms of manufacturing and production.
References and Note
- Walter W. Piergorsch, “Fisher’s Contributions to Genetics and Heredity, With Special Emphasis on the Gregor Mendel Controversy,” Biometrics, December 1990, Vol. 46, No. 4, pp. 915-924.
- Retraction Watch, “The Center for Scientific Integrity,” https://retractionwatch.com/the-center-for-scientific-integrity.
- Neuroskeptic, “Using Science to Sniff Out Science That’s Too Good to Be True,” Discover, Sept. 19, 2012, https://tinyurl.com/retract-watch-blog.
- Uri Simonsohn, “Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone,” Psychological Science, 2013, Vol. 24, No. 10, pp. 1,875-1,888.
- Associated Press, “Volkswagen Fined $2.8 Billion in U.S. Diesel Emission Scandal,” April 21, 2017, https://tinyurl.com/ap-vw-scandal.
- See Daniel 1:1-16 in the Bible, which compares a vegetarian-water diet to a meat-wine diet.
- Shrikant I. Bangdiwala, “Understanding Significance and P-Values,” Nepal Journal of Epidemiology, 2016, Vol. 6, No. 1, pp. 522-524.
Julia E. Seaman is research director of the Quahog Research Group and a statistical consultant for the Babson Survey Research Group at Babson College in Wellesley, MA. She earned her doctorate in pharmaceutical chemistry and pharmacogenomics from the University of California, San Francisco.
I. Elaine Allen is professor of biostatistics at the University of California, San Francisco, and emeritus professor of statistics at Babson College. She is also director of the Babson Survey Research Group. She earned a doctorate in statistics from Cornell University in Ithaca, NY. Allen is a member of ASQ.