Chapter 12

Data Cleaning

Screening survey data for quality

Aaron J. Moss, PhD, Leib Litman, PhD, & Jonathan Robinson, PhD ~25 min read

Introduction

Imagine this.

You work for a large multinational food corporation. Your team is creating a marketing campaign to prepare for the launch of a new line of spicy snacks. To see what consumers think of potential names, you design a survey and recruit participants online.

In the survey, you measure what people think of the concept of 'fire eaters' (remember, these are spicy snacks). Although the term is clever and holds great marketing potential, some people on your team are worried about negative associations. As it turns out, fire eaters was the term used to refer to a group of pro-slavery, southern Democrats within the United States around the time of the Civil War. Your survey assesses what people think of 'fire eaters' overall, and whether they know of the historical connection to slavery. Then, after telling people about the history of the term, the survey measures whether people think it is appropriate for a brand to use modern-day fire eaters in an ad. The survey reveals three things.

First, people generally feel positively toward the term 'fire eaters.' Second, about a third of people (35%) know about the term's historical connection to slavery. Third, even after people are informed about the historical context of the term, few disapprove of a brand using modern-day fire eaters in an ad. With these results, your team develops a marketing campaign, spends hundreds of thousands of dollars, and prepares to launch the new snacks. So far, this is how research is supposed to work.

But there's a catch. Unbeknownst to you, the data from your survey is low quality. The platform used to sample people has poor procedures for vetting participants. In addition, because the survey did not include measures of data quality, your team cannot tell whether participants were engaged or not. The only indication that something might be off is the number of people who reported knowing about the history of the term fire eaters, a number that seems high. If the survey is full of fraud, then unreliable respondents may have systematically skewed the data, causing your team to overestimate how positively people feel toward 'fire eaters' and underestimate how many people think the campaign is a bad idea. In the end, your brand could get burned.

If this situation sounds contrived, it is not. This is the situation researchers at Kellogg's encountered a few years ago. After conducting a small pilot study, Kellogg's was concerned that something may be off with their data. Before moving forward with their ideas, they began working with CloudResearch to detect survey fraud and gather better insights. You can learn about their story by visiting https://www.cloudresearch.com/fire/ and watching a presentation that was given to a conference of marketing professionals. Throughout this chapter, we will examine how the researchers went about cleaning and screening their survey data.

In Module 12.1, we will learn how to evaluate the quality of a dataset. Drawing on the advice presented in Chapter 11, we will download a different version of the fire eaters dataset and evaluate the attention checks and open-ended responses that were included to measure data quality. This section will demonstrate step-by-step procedures for identifying which participants to exclude from further analyses, and it will explain how to approach decisions about data quality systematically.

In Module 12.2, we will learn about some advanced screening methods. This section introduces techniques like individual consistency measures, response pattern analysis, timing data, and multivariate outlier detection. While each of these methods can contribute to higher quality data, we will learn about their limitations and why the simpler approach demonstrated in Module 12.1 is more practical for most behavioral research studies.

By mastering the screening techniques in this chapter, you can ensure the hard work that goes into each research project produces reliable insights rather than misleading conclusions. The credibility of your research depends upon data cleaning, so let's learn to do it right.

Chapter Outline

Module 12.1

The Fundamentals of Data Screening

Walk through the basics of screening and cleaning online data

Let's say we have designed a study, recruited participants, and watched the responses roll in. Now comes the fun part: diving into the data. But before we can know what the data say, there is a crucial step that determines whether the results will be meaningful or misleading: screening the data for quality. What does screening entail?

First, we need to confirm the data were properly collected. Did random assignment work, were there programming errors that prevented participants from answering questions, were all items properly recorded on consistent scales, did any stimuli or media get improperly displayed? Second, we need to confirm participants answered all (or nearly all) of the questions in the study. People with large amounts of missing data may need excluded from the analyses, and participants with small to moderate amounts of missing data may require action to replace their missing scores. While examining the completeness of the data, we need to ensure each participant has only one row of data. If there are duplicate entries, we must decide which row of data to keep (usually the first).

Finally, we should examine the quality of the responses. How did people perform on attention checks, did anyone provide patterned responses, are there outliers or other signs that identify participants who may harm data quality? If so, we want to prevent these people from negatively influencing the analyses. Only after methodically cleaning the data can we perform our analyses and test our hypotheses.

The Purpose of Data Screening

Why must we invest time cleaning and screening data before answering our research questions? Why can't we get right to the fun stuff?

As discussed in Chapter 10, the answer is that unscreened data can lead to misleading or entirely false conclusions. Across online sources, 20-40% of participants often provide questionable responses (e.g., Reavey et al., 2024), making data screening a critical skill for any researcher working with online samples.

One way to approach data screening is to imagine it as detective work. The task is to gather evidence about each participant's engagement and compliance with study procedures. Like in the legal system, participants deserve the presumption of innocence (i.e., they gave a good effort) until evidence suggests otherwise. When signs of inattention, low effort, fraud, or failure to follow instructions appear, we must determine whether sufficient evidence exists to exclude the person's data from analyses. The credibility of our research depends upon this decision.

Evaluating Data Quality: An Example

To demonstrate the steps of cleaning data, we will evaluate a version of the fire eaters survey mentioned at the start of this chapter. This survey was conducted by CloudResearch to evaluate the quality of participant responses on Connect. Because the survey was borrowed from a real marketing campaign, it provided a realistic opportunity to measure data quality.

The survey QSF file, study materials, and data file are on the Research in the Cloud OSF page: https://osf.io/a8kev/. To get started, download the QSF file and import it into Qualtrics or Engage. Then, take a minute to explore the format.

As you examine the survey, you will see that participants were initially asked some demographic questions. Then, they were presented with pictures of modern-day fire eaters and asked what feelings or associations come to mind. The next several questions asked participants how they felt about people who perform with fire and how they felt about a company using fire eaters in an ad. Finally, people were asked if they knew about the historical connection between fire eaters and slavery. After learning about the history, people were asked again how they felt about a brand using fire eaters in an ad. Throughout this short survey, there were several measures of data quality that follow the recommendations presented in Chapter 11.

After reviewing the survey, download the data file. It is named "RITC_DATA_CH12_FireEaters.sav." Inside, there are responses from over 700 participants who completed the survey in July of 2024. Once the file is open, we are ready to screen the data.

Evaluating Attention Checks

With most studies, the first thing to do after verifying the data were properly recorded is to evaluate measures of quality. The goal is to identify participants who should be excluded from the analyses or scrutinized further.

The fire eaters survey included five attention check questions. Three of the questions were yea-saying items ("Are you currently a member of the LPAKE group?" "Are you currently employed as a Petroleum Engineer?" and "At this moment, are you in New Rock, Indiana?"); one question instructed participants on the correct response ("Please select 'somewhat positive'"); and, one item asked about an impossible event ("I can run 2 miles in two minutes"). We want to know how many people failed more than one of these questions.

Computing an Attention Check Score

To evaluate people's performance on the attention check questions, we need to create a score that indicates how many checks each participant failed. People who failed too many items, should be excluded from the analyses.

Recoding Variables

Tallying up how many checks each participant failed requires recoding the questions into a pass or fail format. We might, for example, recode all the answers to the question "Are you currently a member of the LPAKE group?" (1 = yes, 2 = no) into a new variable named "LPAKE_pass" with a 0 = pass, 1 = fail format. Then, we can do the same for all the other attention checks before we sum how many items each person failed.

To recode a variable in SPSS:

  1. Select "Transform" → "Recode into Different Variables"
  2. Move the attention check variable into the "Input Variable -> Output Variable box" (Figure 12.1)
  3. Name the new output variable (e.g., "LPAKE_pass")
  4. Click "Old and New Values" to specify the coding scheme (Figure 12.2)

After specifying the coding scheme, select "Ok" to execute the analysis or "Paste" to add the command to a syntax file. Then, we can repeat the process for the other attention checks.

While the LPAKE example has a "Yes" or "No" response scale, the question asking participants to select 'somewhat positive' has a 5-item scale, ranging from extremely negative to extremely positive. Using the dialog box, we would recode the old value of 4 (somewhat agree) into a new value of 0 (indicating a passing answer). Then, we would recode all other values into a 1 (indicating a failing answer). After recoding all attention checks in the study, we need to tally the number of items failed by each participant.

SPSS Recode into Different Variables dialog box showing how to find a variable in the list on the left and move it to the right using the arrow
Figure 12.1. To recode a variable, find it in the list on the left and then move it to the right using the arrow.
SPSS Old and New Values dialog box showing how to enter the old value, the new value, and then select Add
Figure 12.2. In the recode box, enter the old value, the new value, and then select "Add."

Computing an Overall Score

Once the attention checks are recoded, we can compute an overall score that tells how many checks each participant failed. To do that, we will use SPSS's "Compute" function.

The "Compute" function allows researchers to create new variables based on a wide variety of mathematical operations. In this case, we will sum the number of attention checks each person failed. To do so:

  1. Select "Transform" → "Compute Variable."
  2. Type the name of the new variable into the "Target Variable" box. We might, for example, type "attn_total" or "ac_fails" (Figure 12.3).

In the 'Numeric Expression' box we will write the formula for the new variable. For instance, to sum the attention check questions named LPAKE_pass, petroleum_pass, instresponse_pass, newrock_pass, and twomiles_pass, we would enter:

Sum (LPAKE_pass, petroleum_pass, instresponse_pass, newrock_pass, twomiles_pass).

Finally, select 'Ok' to execute the action or 'Paste' to add the command to a syntax file. Then check the new variable in the data file. It is a good idea to 'spot check' a few values by manually confirming the calculation was correctly executed.

SPSS Compute Variable dialog box showing the Target Variable box and Numeric Expression box for entering the formula
Figure 12.3. To name a new variable, type something into the "Target Variable" box. The "Numeric Expression" box is where we enter the formula for calculating the variable.

Tallying Pass and Fail Rates

With the overall score computed, we are ready to examine the percentage of participants who passed and failed each items and people's overall performance. To start, we will run a frequency report for each item in SPSS.

  1. Select "Analyze" → "Descriptive statistics" → "Frequencies."
  2. Find the variable(s) we want to examine (e.g., LPAKE_pass). Move each item into the "Variable" box on the right and select 'Ok' to run the analysis.

The Output will appear similar to the data in Figure 12.4. There, we can see that most participants passed the LPAKE question. Out of 770 participants in this example, 766 passed (99.5%). Just 4 participants failed the question.

SPSS Frequency Table showing pass/fail rates for the LPAKE item: 766 passed (99.5%), 4 failed (0.5%), Total 770
Figure 12.4. A frequency output table that shows the percentage of participants who passed and failed the LPAKE question.

Next, we can examine the pass rate for each of the other items. Doing so shows that a handful of participants failed most items and 24 people failed the instructed response item. More importantly for the analysis, however, we want to know how many people failed multiple attention checks. These people should be removed from further analyses.

To find that information, we can conduct a frequency report on the total attention check scores. When we perform that analysis, the results look like Table 12.1.

Items Failed Frequency Percent
0 736 95.6
1 29 3.8
2 3 0.4
3 2 0.3
Table 12.1. A frequency report shows that most participants passed all the attention checks in the fire eaters survey.

From the table, we see that over 95% of participants passed all the attention checks. Just under 4% of people failed one item, and less than one percent of people failed multiple items. This level of performance reinforces the lesson from Chapter 9 where we learned that researcher-centric platforms often provide better data quality than market research panels.

Before analyzing the data further, we would create a filter that excludes the participants who failed multiple attention checks. This helps protect the data from being distorted by inattentive participants. However, attention checks are just one metric of data quality. To better evaluate each person's responses, we can also evaluate their open-ended answers.

Evaluating Open-Ended Responses

Open-ended questions are an important measure of data quality, as explained in Chapter 11. Open-ended items often expose participants who use AI to write responses, copy and paste answers from the internet, fail to follow instructions, and may be engaged in survey fraud. Unfortunately, however, there is no quick way to evaluate open-ended answers. Even so, some answers are so reliably linked to fraud that it is worth the time to read people's responses. Let's look at how to evaluate open-ended answers.

Sorting the File

The first thing we recommend when examining open-ended responses is to sort the file in alphabetical order. Sorting allows us to easily spot duplicate responses or other patterns in what people have written.

To sort the file, navigate to the column with the open-ended question in the "Data View." In the fire eaters dataset, this column is labaled 'explain.' The question asked participants to explain more about the associations and words that came to mind when they viewed images of fire eaters earlier in the study. Because the question required participants to explain a response from earlier in the survey, it offered a good test of whether participants were paying attention or engaged in fraud. To sort the 'explain' column, right click on the column and select "Sort Ascending." The answers will appear in alphabetical order.

After sorting the file, we can create a new variable to indicate if each participant passed or failed the open-ended item. Create this variable by right clicking next to the column we sorted and select "Insert Variable." A new variable will be added to the file. We can name this variable something like "explain_fail." Then, we are ready to grade each response.

Grading Responses

Grading each response requires reading it and judging whether it's acceptable or not. We will score all passing answers with a 0, any response we are unsure about with a 1, all failing answers with a 2. This will allow us to easily sort the responses later.

When grading open-ended responses, keep the information from Chapters 10 and 11 in mind. While most answers will be short and simple, some will be unrelated to the question prompt (such as "good"), will not follow the instructions (e.g., writing a single word when the question asked for a sentence), or will appear copy and pasted from the web (e.g., "Fire-eaters were southern political ideologues who had uncompromising demands and played an important part of driving the nation"),. In each of these cases, mark the response as a failure and exclude the participant from further analyses.

To get a sense for the difficulty of this task, try grading 50 or 100 open ended answers. Then, continue with the next section as if we have scored all the responses.

Excluding Participants

After evaluating the measures of quality, the next step is to exclude people whose data may harm the analyses. Following the advice from Chapter 11, we will exclude anyone who failed more than one attention check or gave an open-ended response that indicated fraud.

To exclude people, we need to create a variable named "exclude" or "drop." Within this column, we will assign a value of '1' to all participants we intend to remove from the analyses and a value of '0' to participants we will retain.

In assigning values to each participant, we will sometimes confront evidence from attention checks and open-ended responses that present a mixed picture. We will also be forced to decide about participants whose open-ended responses we were unsure of. In these situations the decision may depend on the research objectives. Descriptive research requires guarding against low-quality data because even a small number of unreliable respondents can easily distort the research findings as detailed in Chapter 10. Correlational and experimental designs are more resilient to a low single digit percentage of problematic respondents, especially if they are randomly distributed across experimental conditions.

Once we have decided on which respondents to exclude from further analyses, it is worth comparing the included and excluded participants on any benchmarking questions the study included (see Chapter 11) or the study's main outcomes. Unreliable participants often show a noticeably different pattern on benchmarking questions compared to reliable respondents, providing confirmation that the right people were excluded from analyses. In the dataset from Connect, there were not enough questionable responses to exclude to make a large difference in the outcomes. In the data that Kellogg's collected with CloudResearch, however, excluding people made a big difference, as shown in the conference presentation.

While the screening detailed above is effective, in some situations, researchers want to screen their data more closely. The advanced techniques described in the next module offer further scrutiny. As we will see, however, these techniques can be difficult to implement and are not fit for every study.

Module 12.2

Advanced Techniques of Data Screening

Learn about statistical measures of data quality and their limitations

After learning about the attention checks and open-ended screening methods above, researchers sometimes wonder: "Isn't there anything more sophisticated? Aren't there screening methods grounded in statistics?" The answer is 'yes.' There are numerous statistical methods for identifying low-quality responses in survey data (e.g., DeSimone et al., 2015). However, these approaches come with significant limitations that make them less practical than the approach outlined above.

In this section, we introduce some of the advanced methods for screening data, provide a conceptual understanding of how they are used, and highlight their strengths and limitations. For most studies, the approach described in Module 12.1 will be sufficient and more practical to implement.

What Advanced Methods Do

The appeal of advanced screening methods lies in their covert nature. These measures use information that participants do not realize is being monitored to assess response quality. Unlike attention checks that participants can clearly identify, advanced methods work behind the scenes. And they analyze patterns in how people respond throughout the entire survey. The core assumption is that effortful, attentive responding should produce certain patterns in timing, consistency, and response choices, while careless or random responding will violate these patterns in ways a researcher can detect.

Advanced screening methods generally fall into three categories. First, consistency-based measures examine whether participants provide similar responses to items that should logically receive similar answers, or different responses to items that should receive different answers. Measures in this category include synonym-antonym pairs that identify similar or opposite items within a survey and personal consistency measures that examine whether individual participants respond reliably within each scale they complete (DeSimone et al., 2015). It also includes measures like the Squared Discrepancy Score, which examines the consistency of people's responses to scale items (e.g., Litman et al., 2017).

Second, timing-based measures assume that thoughtful responding requires a minimum amount of time. These measures flag participants who complete surveys or individual items too quickly to have read and considered the content.

Third, pattern-based measures look for aberrant response patterns such as selecting the same response option repeatedly (straight-lining) or producing response combinations that are statistically unusual compared to the rest of the sample (Mahalnobis distance).

The appeal of these approaches is that using measures participants cannot easily evade, researchers can catch people who are not paying attention. Furthermore, by adopting statistical approaches to identify who provides reliable and unreliable data in a survey, these advanced methods promise to catch different types of problematic responding in a manner that feels more objective than examining people's performance on attention check questions. In reality, however, using these measures isn't so simple.

Why Advanced Methods are Hard to Implement

While advanced screening methods appear rigorous, they prove much harder to implement than many researchers initially expect. In fact, the practical challenges often make them impractical for most studies.

The first major hurdle is that these methods typically require long surveys with many items per construct to function effectively. Synonym and antonym pair approaches need sufficient items to reliably identify pairs with high correlations, which usually means having multiple items measuring similar concepts spread throughout the survey. These data quality measures are, of course, in addition to whatever measures the survey is actually focused on. Personal consistency measures also require multiple scales with numerous items each to produce stable reliability estimates. This creates a fundamental tension for researchers who want or need to conduct short studies. The brevity that makes online studies fast and affordable undermines the screening methods that require many items to assess quality.

Even with appropriately long surveys, however, there is no guarantee that reliable synonym or antonym pairs will emerge in any specific dataset. This means researchers often face a difficult choice: either invest extensive time before data collection identifying and pretesting potential item pairs or wait until after data collection to identify pairs statistically. The first approach requires investing time and money that may be wasted if the planned pairs do not perform as expected in the sample. The second approach means the researcher will not know if the method is feasible until after the data are collected.

Perhaps more surprisingly, for methods that appear statistically rigorous many advanced screening techniques require researchers to make subjective judgments about things like cutoff scores. Because measures of reliability or consistency can vary from measure to measure, there may be few established benchmarks for any particular measure. This means many decisions are made without clear empirical guidance about what constitutes reasonable standards.

The cumulative effect of these challenges means that advanced screening methods often consume far more time and resources than anticipated, with uncertain benefits that may not justify the investment. Many researchers discover that straightforward attention checks and open-ended screening questions provide a good assessment of data quality with far less complexity.

When Advanced Methods Might Be Worth Considering

Advanced screening measures can be worthwhile in large studies that include multiple scales. In a long study that tests many items for a new measure or assesses many different psychological constructs, researchers may find it appropriate to include some of the advanced screening methods.

For most student projects and many professional studies, however, the combination of well-designed attention checks and open-ended questions provides a more practical approach to screening. The sophistication of advanced methods does not automatically make them superior. Often, simpler approaches that can be implemented well are more valuable than complex methods that introduce their own complications.

Good data screening should match a study's specific needs, timeline, and expertise rather than defaulting to whichever method appears most statistically sophisticated. Sometimes the most rigorous approach is the one you can execute with the time and resources that are available.

Summary

Data screening is critical to determining whether research findings are meaningful or misleading. This chapter demonstrated how to screen data to avoid misleading conclusions.

Typically, the data screening process involves three steps. First, a researcher downloads the data and ensures the file is complete, things worked as expected in the study, and the data are ready to analyze. Second, the researcher evaluates the quality of each participants' responses. Finally, the researcher decides who to exclude from analyses.

We demonstrated a straightforward approach to data screening that is suitable for most online research, and we gave you an opportunity to practice that approach with a real dataset from a marketing study. We also introduced more advanced measures and described both their strengths and weaknesses for online studies. Regardless of how exactly survey data are screened, it is important to remember the goal of data screening: removing unreliable responses that can mislead your research findings. In the next chapter, we will learn how to design survey studies so that they are easy for participants to navigate, improving data quality.

Additional Readings

  • Berinsky, A. J., Margolis, M. F., & Sances, M. W. (2014). Separating the shirkers from the workers? Making sure respondents pay attention on self‐administered surveys. American journal of political science, 58(3), 739-753.
  • Hartman, R., Moss, A. J., Rabinowitz, I., Bahn, N., Rosenzweig, C., Robinson, J., & Litman, L. (2023). Do you know the Wooly Bully? Testing era-based knowledge to verify participant age online. Behavior research methods, 55(7), 3313-3325. Kung, F. Y., Kwok, N., & Brown, D. J. (2018). Are attention check questions a threat to scale validity?. Applied Psychology, 67(2), 264-283.
  • Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological methods, 17(3), 437.
  • Rivera, E. D., Wilkowski, B. M., Moss, A. J., Rosenzweig, C., & Litman, L. (2022). Assessing the efficacy of a participant-vetting procedure to improve data-quality on Amazon's Mechanical Turk. Methodology, 18(2), 126-143.

Frequently Asked Questions

What are the main steps in the data screening process?

The data screening process involves three main steps: First, download the data and ensure the file is complete and the data are ready to analyze. Second, evaluate the quality of each participant's responses using attention checks and open-ended questions. Third, decide which participants to exclude from analyses based on the evidence gathered.

How should I evaluate attention check questions in my survey data?

To evaluate attention checks, recode each question into a pass/fail format (0 = pass, 1 = fail), then compute an overall score that tallies how many checks each participant failed. Participants who failed multiple attention checks should typically be excluded from analyses. Run frequency reports to see pass and fail rates for each item and overall performance.

Why are open-ended questions important for data quality screening?

Open-ended questions are important because they can expose participants who use AI to write responses, copy and paste answers from the internet, fail to follow instructions, or are engaged in survey fraud. While there is no quick way to evaluate them, some answers are so reliably linked to fraud that reading responses is worth the time investment.

What are advanced data screening methods and when should I use them?

Advanced screening methods include consistency-based measures (synonym-antonym pairs, personal consistency), timing-based measures (flagging fast responders), and pattern-based measures (straight-lining, Mahalanobis distance). While they sound sophisticated, they require long surveys with many items and often involve subjective cutoff decisions. For most studies, straightforward attention checks and open-ended screening provide adequate quality assessment with less complexity.

How do I decide which participants to exclude from my analyses?

Create an 'exclude' variable and assign a value of 1 to participants you intend to remove and 0 to those you will retain. Generally, exclude anyone who failed more than one attention check or gave an open-ended response indicating fraud. The decision may depend on research objectives: descriptive research requires stricter screening since even a small number of unreliable respondents can distort findings, while correlational and experimental designs are more resilient to a low single-digit percentage of problematic respondents.

Key Takeaways

  • Data screening is critical to determining whether research findings are meaningful or misleading
  • The screening process involves three main steps: verify data collection, evaluate response quality, and decide who to exclude
  • Approach data screening like detective work, gathering evidence about each participant's engagement while presuming innocence until evidence suggests otherwise
  • Attention checks should be recoded into pass/fail format and summed to identify participants who failed multiple items
  • Open-ended questions can expose AI-generated responses, copy-pasted answers, and failure to follow instructions
  • Sort open-ended responses alphabetically to easily spot duplicate responses or patterns
  • Descriptive research requires stricter screening than correlational or experimental designs
  • Advanced screening methods (consistency measures, timing data, pattern analysis) sound sophisticated but are harder to implement than expected
  • Advanced methods require long surveys with many items and often involve subjective cutoff decisions
  • For most studies, straightforward attention checks and open-ended screening provide adequate quality assessment with less complexity