A quick overview of the reproducibility crisis

By: Jack McAlpine


 

Nothing tastes like a home cooked meal. I have tried to emulate some of my mother’s, father’s, grandmother’s, aunt’s, and basically every other family member’s food. Most of the time it tastes basically the same, but there have been a couple of times where things did not taste like I remembered. I would go over the recipe see if I made any mistakes and give it another try. Still, things would not taste right. Eventually, I would call the author of the recipe and ask what I could be doing wrong. More often than not I was told to do something that was not written down. These frustrations can be easily related to scientific discovery. Nothing is more frustrating than trying to repeat another researcher’s results and getting different numbers or trends. I am not alone in these frustrations. Reproducing data is a problem for all scientists. In fact, this problem is so large it is called the reproducibility crisis. I have linked an article about scientists’ views on the crisis in the additional information that I think is best to be read in its entirety. In brief, the majority of scientists polled have had trouble not only reproducing other people’s data, but also their own.. I hope to give a small survey of ideas with this post and the encouragement to find more on your own.

A recipe from my grandmother. “Use what spices you wish” does not help me make it like she does.

The Stanford computer science department has a project where both undergraduate and graduate students try to reproduce the results of published papers.1 Most years see over 90% of students reproduce the results of the studies they were assigned, which should be expected. The publication provides several examples of groups who experienced difficulties with the assignment but found unexpected reasons why. One group’s hardware capabilities were severely limited. This could happen to any research group. If the capabilities of your spectrometer, computer, or general supplies are not of certain quality, reproducing the data will be impossible. A measurement can be unspecific and gloss over important details leading to the illusion of correct results. Another group had to contact the authors of the original paper and get guidance. The publication does not shine light on the whether this was the fault of the students or the authors, but either way, the methods of the assigned paper were not clear enough on their own. This group was able to obtain more similar results after their conversations but was still unable to completely recreate the results. The students believe that a difference in hardware caused the discrepancy. Since the original work was conducted years prior, the corresponding hardware had been dismantled and repurposed, so this hypothesis could not be checked. Some students were able to improve on the results of their assigned paper. This was accredited to the advancement of computational algorithms and technology. Enough time had passed that previous limitations were no longer a barrier a hinderance. While procedures may be miscommunicated, the instrumentation of the researcher also needs to be called into question. This study gives a positive outlook on data reproducibility. It appears that instrumentation is a common culprit for why data does not match. Further, by contacting the authors of the paper you are working with you can gain better insights.

While the research at Stanford appears very positive, the sample size is relatively small. Smaller sample sizes lead to experimental noise. It is near impossible to avoid all forms of interference when conducting an experiment. Dr. Erik Loken of the University of Connecticut and Dr. Andrew Gelman of Columbia University fear that noise causes the overstatement of data.2 An analogy for a runner with a backpack is used. If a runner is wearing a heavy backpack, they will not go as far as if they were not wearing the backpack. This analogy has been messily applied to data collection. Seeing a response in the presence of noise would mean that the response would be greater without the noise, right? Loken and Gelman say it would not. A small data set with selection bias can show strong correlations in a sea of interference. Some groups have been accused of ‘p-hacking’ where certain data sets are selected or omitted not because of their validity, but because they support the team’s hypothesis. Surface level statistics will support this correlation even if the conclusion is misguided. A small or nonexistent response can be blown out of proportion. While researchers try to chase these values, productivity is halted.

Dr. Roger Peng of the John Hopkins Bloomberg School of Public Health argues that a failure to reproduce data comes from too many people taking it.3 He believes that the failure is less due to the proper collection of the data, but the transformation and statistical analysis of the raw values. Often, publications do not include the raw data, only the final transformations in nice figures. This makes it easy to read, but near impossible to reproduce. Does the error in the transformation lie with the original author or the scientist trying to reproduce the results? Without the raw data, how can another researcher evaluate the validity of what is reported? Dr. Peng believes that the average researcher is not properly trained in data analysis. The solution he offers is to pair better education with data analysis strategies that are robust and reproducible. While this is a very grand idea, it is hard to implement. How can researchers better communicate their methods and data collection practices?

Dr. Frank Caruso from the University of Melbourne has started to film procedures.4 The videos do not show everything that the group does, but they outline the procedures they follow for producing their materials. Filming is the best way to show everything that is done in the lab, so why has it not caught on? Dr. Caruso works in the synthesis of materials where there is a plethora of steps that require great care. Further, there are often visual cues as to what should be going on. Recording the methods of a different field could yield a video that provides no information. For my work, I mix salts into solution, put them in a vial, and attach a couple of electrodes to run a computer program. Filming this would not provide valuable information unless I was really screwing something up, which is not uncommon. However, data that is funky from the get-go should not be publishable. It would be more insightful for me to publish the sequence of techniques I have the computer run for my electrochemical experiments. Hopefully, other scientists will find a way to display all of their techniques in a helpful way like Dr. Caruso.

Data transformation can go through so many steps that it can be unclear how you even got there. (Image source: Ralph Savelsberg)

When I started writing this, I was unaware about a reproducibility crisis and was lamenting problems that me and some of my peers had. While I knew that some published data was faulty, I was unaware of the global concern. Technological limitations, statistical inaccuracies, and failure to properly explain a process are only the tip of a problem that is becoming more and more prevalent. Maybe Dr. Peng’s idea that too much information is available leads to poor analysis on everyone’s part. As with all things in life having multiple sources is always a good idea. And part of those sources is taking what you have read here with a grain of salt. I have referenced four studies and linked an article summarizing a survey, a small sample size prone to noise.

 

 


Additional Resources:

If you’re interested in learning more, here are some useful (and free!) resources to check out:

 

  1. To look at some of the videos Frank Caruso’s group produces, follow this link.
  2. For a more detailed look on how scientists trust data follow this link.

 


Works Cited:

 

  1. Yan, L., & McKeown, N. (2017). Learning networking by reproducing research results. Computer Communication Review, 47(2), 19–26. https://doi.org/10.1145/3089262.3089266
  2. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585. https://doi.org/10.1126/science.aal3618
  3. Peng, R. (2015). The reproducibility crisis in science. 30–32.
  4. Faria, M., Caruso, F., & Editor, A. (2016). Advancing Research Using Action Cameras. (5), 8441–8442. https://doi.org/10.1021/acs.chemmater.6b04639