Clean Data: the Key to Accurate Conclusions

Cheskie Rosenzweig, MS

By Cheskie Rosenzweig, MS, Leib Litman, PhD & Aaron Moss, PhD

Considering how much time and effort goes into designing, fielding, and analyzing online surveys, it is critical that the responses researchers collect are high quality. When companies make decisions based on inaccurate data, they inevitably plan poorly for the future, and can lose control of their brand or alienate their consumer base. This is why survey data cleaning is important – it removes low-quality respondents, leading to more accurate conclusions.

Why Evaluating Survey Responses Can Be Time-Consuming and Expensive

Most research teams follow a standard set of policies to evaluate survey responses and determine data quality. Typically, individual responses are flagged based on the following participant response characteristics: 

  • speeding 
  • missed attention check questions
  • nonsense open-ended answers
  • inconsistent responses 

The process of examining these different characteristics can be time consuming and expensive, and sometimes leaves researchers with a lack of confidence in the quality of their data. Often researchers are tweaking their data quarantining procedures and may apply them differently depending on the sample providers they use, or the kind of survey data they collect.

How Insufficient Data Cleaning Can Lead to Inaccurate Research Conclusions

Data quality can have an outsized impact on research conclusions when researchers are studying low-incidence behaviors. Low-incidence behaviors are behaviors that occur infrequently within a population of interest. When researchers study low-incidence behaviors even a few inattentive participants can throw research conclusions way off. This is because respondents who click through a survey without paying attention or at random may indicate that they partake in such behaviors, artificially inflating how often such behaviors actually occur. 

Even small errors in determining whether respondents are high quality or not can sometimes lead to drastic outcomes. For example, imagine a study that asks teenagers about their history of drug use. Because the topic is sensitive, you may be concerned that participants will give socially desirable responses and underreport how often they use different drugs. While people do sometimes respond in socially desirable ways, research suggests that people also sometimes overreport their use of different drugs. 

In a real study of adolescent drug use in Norway, researchers found that less than 1% of participants reported using a fake drug “zetacyllin” in the past. Yet, among this small group of poor respondents, self-reported use of other drugs and alcohol was much higher than in the rest of the sample. For example, 79% of participants who said they used the fake drug zetacyllin also said they had used cannabis in the previous year compared to just 11% who said they had not used zetacyllin. In fact, the participants who said they had used the fake drug accounted for a large share of overall drug use in the entire sample. Results like this show that without proper screening and cleaning procedures, researchers can be led astray by poor data.

Identifying High-Quality Participants Using SENTRYTM

A case just like this made headlines recently: researchers at the CDC claimed that close to 39% of Americans engaged in high-risk cleaning practices to prevent the spread of COVID-19, including some low-incidence behaviors such as using bleach on food products, and inhaling or ingesting cleaners and disinfectants. At we replicated this study, but included a standard protocol to identify high-quality participants so we could analyze their data separately. We not only showed that there was a much lower rate of high-risk behaviors among high-quality participants, but also that 90% of high-risk cleaning behaviors were reported by problematic respondents.

In the past month, have you or a household member engaged in the following behaviors to prevent coronavirus?

This image has an empty alt attribute; its file name is LVO3xU_BhaVYN565oVpfpqZb48xw7gyLInk0Zy-eANRndw3_kKbV--deg6wU1DiSo4oUjQYGaQYU5gzqVRr6vs5UYmoftl9cKTu8i4hvLWBXnXDrOa2Q2K8ESrrTA3DiL-CgMXq2

To separate high-quality participants from low-quality participants, we used SENTRY, a system that identifies and removes inattentive or fraudulent respondents from any sample source. SENTRY is flexible, meaning that researchers can set stringent or more lenient cut-offs depending on the needs of an individual study. For a study like the one described above, we used the highest level of SENTRY because the study’s goal was to measure very low-incidence behaviors.

As a system, SENTRY greatly minimizes the burden on a research team by preventing inattentive, inconsistent, fraudulent, and otherwise poor respondents from entering your surveys. SENTRY ensures that you can have confidence in your data and can be customized to the demands of different studies. Learn more about SENTRY, or sign up for a webinar to hear how SENTRY works and can be helpful in your research.

Related Articles