Chapter 11

Data Quality Solutions

Protecting research from fraud and inattention

Aaron J. Moss, PhD, Leib Litman, PhD, & Jonathan Robinson, PhD ~30 min read

Introduction

If you conduct enough research, you will encounter problems with data quality. Sometimes these problems are caused by people who do not pay attention, speed through the survey, or misrepresent themselves. Often, the problem is fraud. As discussed in Chapter 10, the percentage of questionable responses across online panels (as opposed to researcher-centric platforms) often ranges from 30-40% (e.g., Weber, 2023). Thus, while the degree of the problem may vary from study to study and from one sample source to another, the problem itself is endemic to online research.

In this chapter, we will learn about effective techniques for identifying and removing problematic respondents from a study. As we learned in the last chapter, low quality data can produce not only misleading conclusions, but a fundamentally incorrect picture of reality.

In Module 11.1, we will examine how to identify problematic participants and how to validate attention checks with benchmark questions. Among the questions we will explore are: how do researchers measure data quality? How do they decide which participants provided honest answers? How can researchers be confident their measures of quality are accurate and what tradeoffs must they consider when determining the appropriate threshold for removing respondents? To explore these questions, we will walk through an extended example from a large-scale study that examined data quality online (Reavey et al., 2024).

In Module 11.2, we will review the various types of attention checks commonly used in behavioral research. We will examine different approaches to identifying low quality respondents, like attention checks, comprehension checks, yea-saying questions, and open-ended validation, comparing their strengths and limitations. We will examine best practices for writing new attention check questions and introduce a website, called SurveyDefense, that can be used to generate different kinds of attention checks.

Throughout the chapter, we will learn about a "fit-for-purpose" approach to data quality. Different research questions require different levels of quality control. The chapter will show how to make informed decisions about appropriate thresholds for specific research goals, and by the end of this chapter, you will have the practical tools to enhance data quality in your online (and offline) research studies.

Chapter Outline

Module 11.1

Detecting Fraudulent Responses

Learn how to spot fraudulent participants

Imagine you have just collected data from 2,000 people online. Before you can analyze the data, you must identify which participants gave thoughtful and honest answers and which ones did not. How would you do it?

The most common approach is to incorporate special questions in the study that detect inattention and fraudulent responding. These questions are often referred to as attention checks (Oppenheimer et al. 2009).

There are many different types of attention checks. In Module 11.2 we will review a variety of techniques that researchers use to measure attention. But first, we will examine how researchers detect a particular form of problematic responding called "yea-saying". The examples we will examine come from a real study that examined data quality in a large online sample (see Reavey et al., 2024).

Detecting Yea-Saying

The last chapter described the global network of survey fraud and the behaviors people use to siphon money out of online surveys. One of the most common strategies employed by fraudsters is to say 'yes' to nearly every question, what is known as yea-saying. To address this yea-saying tendency, researchers often insert questions into their surveys that ask about impossible or highly unlikely scenarios. People who agree with several of these questions are flagged as unreliable and examined more closely for possible removal from the study.

In the Reavey et al., (2024) study, five questions were included to detect yea-saying (see Table 11.1). The first question asked: "Have you seen any live music shows in Shea Stadium, NY, in the last two years?" Shea Stadium was demolished in 2009, making it impossible for anyone to have attended a concert there recently. Participants who answered 'yes' to this question were either lying or not paying attention.

Table 11.1. Five yea-saying questions from Reavey et al., (2024).
Question Type Question Text Why It Works
Nonexistent venue "Have you seen any live music shows in Shea Stadium, NY, in the last two years?" Shea Stadium was demolished in 2009, making "yes" responses impossible
Fictional cruise lines "Please indicate if you have cruised with any of the following companies within the last two years: a. Sail Thru Vacations, b. Yacht-ify Cruises, c. Pacific Travails LTD, d. Outstanding Ocean Outings, e. None of these." None of these cruise companies exist, so only "None of these" is correct
Nonexistent products "Please select which of the following haircare products you have used in the last six months: a. Bunfocore, b. Pleistorene Grow, c. Truefolica Treatment, d. None of the above." None of these brands exist, making "None of the above" the only correct answer
Fictional restaurants "Have you recently eaten food from any of these small food chain stores? a. Cheese and Wine Flavor Junction, b. Zesty's Tomato Pies, c. Paulina's Poutine King d. Toasty Pita BBQ e. None of the above." None of these food chains exist, so "None of the above" is the only correct answer
Highly improbable event "Have you filed a homeowners insurance claim due to damage from lightning within the past three months?" Extremely rare event (occurs in <0.0001% of U.S. population)

Another question asked participants which of several cruise lines they had traveled with in the past two years. The answer options included names like "Sail Thru Vacations," and "Pacific Travails LTD." None of these businesses exist, meaning there was only one correct answer: "None of these."

Another question in the study asked about something that could really happen but is extremely rare: "Have you filed a homeowners insurance claim due to damage from lightning within the past three months?" Less than 0.0001% of U.S. households experience lighting damage in any three-month period. Other questions asked about non-existent hair care products and restaurant chains. How did participants do?

Bar chart showing proportion of problematic and fraudulent respondents across three online platforms: Platform 1 with 32.42% at n=1,095 participants, Platform 2 with 18.22% at n=461 participants, and Platform 3 with 22.41% at n=406 participants
Figure 11.1. Proportion of problematic and fraudulent respondents across three online platforms.

Figure 11.1 shows that from a total sample of 2,000 respondents between 18% and 32% of participants on each of the three platforms tested said 'yes' on at least two out of the five questions. In other words, many participants provided incorrect or unreliable information. But this figure raises several interesting questions.

First, how can we know that the participants who failed these items are bad respondents who should be removed from the study? Perhaps, these are good participants who misunderstood the question or inadvertently provided the wrong answer, what is referred to as a false positive. Second, how can we know whether the yea-saying questions caught all the bad respondents? Perhaps the study included other inattentive or unreliable participants that these questions did not catch. This would be a miss or a false negative. And third, what is the rationale for choosing two out of five as the threshold for identifying problematic responses? Why not use a more lenient (say 3 out of 5 questions wrong) or more stringent threshold (maybe 1 out of 5 wrong)? Let's examine each question one at a time.

Using Benchmarks to Validate Attention Checks

How can we know that the yea-saying questions identified unreliable participants? This is a critical question. If researchers are going to remove 18-32% of their sample, they should be confident they are removing the right people.

The answer comes in what are called benchmark validation questions. These are questions where researchers know, with a high degree of certainty, what percentage of people in the population who should say 'yes' to some behavior. By comparing the percentage of 'yes' responses to a known population parameter, researchers can create a base rate. With the base rate, it is possible to tell whether people are responding accurately or not (see Figure 11.2).

In the Reavey et al. study (2024), there were three benchmark questions:

  1. "Do you own a Tesla?" approximately 1% of the U.S. population owned a Tesla at the time of the study).
  2. "Have you gone scuba diving within the last 12 months?" (less than 1% of people do this annually).
  3. "Do you follow a vegetarian diet?" (approximately 5% of U.S. adults are vegetarian).
Diagram showing the distinct roles of data quality questions with two columns: Used to Screen Participants containing yea-saying questions like asking about nonexistent venues and products, and Used to Validate Screening containing benchmark questions about rare behaviors like Tesla ownership and scuba diving with known population rates
Figure 11.2. Two sets of questions. The first set contains attention checks. The second set contains benchmarking questions. The benchmarking questions are used to validate the attention check questions.

Using these items to validate the yea-saying questions works through a straightforward logic. If the attention checks identify unreliable participants, then the people who fail those items should show a different pattern of response on the benchmarking questions than people who pass the items. Specifically, people who engage in yea-saying should report significantly higher rates of rare behaviors like owning a Tesla, going scuba diving, and eating a vegetarian diet.

As shown in Figure 11.3, this is exactly what happened. Among participants flagged as unreliable because they said yes to two or more attention check questions, between 40-60% claimed to go scuba diving in the last 12 months. That is more than 40 times the actual population rate! A similar percentage of participants also reported owning a Tesla (base rate = 1%) and following a vegetarian diet (base rate = ~5%).

In contrast, participants who passed the yea-saying questions (i.e., the "reliable" group) reported values closer to the population base rates: about 4% reported owning a Tesla, between 3 and 5% reported scuba diving, and between 5 and 9% reported eating a vegetarian diet. While not a perfect match to the population rates, these numbers are much closer to reality.

Another way to look at the data is to examine how many participants reported any of the three rare behaviors. As shown in Figure 11.4, nearly 80% of respondents who failed the yea-saying questions reported at least one of the benchmarking activities. The expected base rate was approximately 7%. Meanwhile, among participants who passed the yea-saying questions, between 9 and 12% reported at least one rare behavior.

Bar chart showing scuba diving rates across three platforms comparing Unreliable participants (61%, 57%, 38%), Reliable participants (3.65%, 3.18%, 5.40%), and Full Sample (28.67%, 14.96%, 12.81%) against an expected population rate of approximately 1%
Figure 11.3. Reports of scuba diving among three groups: 1) participants flagged by attention checks, 2) those not flagged by attention checks, and 3) the full sample. The prevalence of scuba diving in the general population is approximately 1%.
Bar chart showing fraud validation rates across three platforms comparing Unreliable participants (77.46%, 77.38%, 61.54%) versus Reliable participants (10.27%, 8.75%, 12.38%) against an expected population rate of approximately 6.89%
Figure 11.4. Reported rates of engaging in at least 1 of the 3 benchmarking activities (scuba-diving, Tesla ownership, or vegetarianism) among three groups: 1) participants flagged by attention checks, 2) those not flagged by attention checks, and 3) the full sample. The prevalence of at least one of the three activities in the general population is approximately 7%.

The dramatic difference between groups validates the yea-saying measures and reveals that the benchmark questions help establish two things. First, they establish that people marked as unreliable are, indeed, providing unreliable information. This occurs not just on the yea-saying questions but across the entire survey (i.e., on the benchmarking questions, too). Second, the items establish that people marked as reliable are providing reliable information because their responses to the benchmarking questions are mostly aligned with what is expected in the general population.

At the same time, the benchmarking questions raise other interesting considerations. If participants who pass the yea-saying items are generally honest, then why do they report slightly higher rates of rare behaviors than the population averages? For example, why do 3 to 4% of people report owning a Tesla when the base rate is closer to 1%? It is, of course, possible that some degree of unreliable responding remains in the sample even after removing participants who failed multiple yea-saying items. If so, this observation would lead directly to the next question we will examine: does setting a more stringent exclusion criteria further improve data quality?

Choosing an Appropriate Exclusion Threshold

Whenever researchers use yea-saying items or attention checks, they must set a threshold for identifying people who should be removed from the analyses. So far, we have examined a threshold of failing at least two out of five yea-saying questions, an approach that proved effective at identifying unreliable participants. However, the benchmarking data also showed that participants in the reliable group were not perfectly aligned with the population parameters. This raises the question of whether it would help to use more stringent exclusion criteria.

To explore this issue, consider what happens to just the scuba diving question when we change the threshold from failing two or more questions out of five to failing one or more questions.

As shown in Figure 11.5, the more stringent threshold reduces the percentage of participants reporting scuba diving, closer to the population average. Among reliable participants, those who passed all five yea-saying questions, the reported rate of scuba diving drops even lower than it was previously. For example, on platform 2, it drops from 3.2% to 2.1%. As a reminder, the expected population value is approximately 1%. Thus, this suggests a more stringent threshold removed more unreliable respondents who would have been missed with the more lenient threshold.

Bar chart showing scuba diving rates with stringent threshold across three platforms comparing Unreliable participants (51.11%, 43.80%, 32.26%), Reliable participants (3.18%, 2.06%, 4.26%), and Full Sample (28.67%, 14.96%, 12.81%) against an expected population rate of approximately 1%
Figure 11.5. Rates of engaging in at least 1 of the 3 benchmarking activities (scuba-diving, Tesla ownership, or vegetarianism) among three groups when a more stringent threshold is applied. Endorsement of even 1 out 5 yea-saying questions qualifies for being categorized as unreliable or fraudulent.

However, something interesting occurs among people marked as unreliable. With the stricter exclusion criteria, the percentage of unreliable participants who claim to have gone scuba diving also decreases—by about 13 points from the earlier analysis. Specifically, 57.1% of respondents who failed at last two yea-saying questions reported having gone scuba. Among participants who failed at least one yea-saying question the reported rate of scuba diving was 43.8%. In other words, setting a more stringent threshold resulted in 13% of respondents who were marked as problematic by the yea-saying questions performing perfectly well on the benchmarking question. Why?

The answer lies in how the two groups are defined. When the more stringent threshold is applied, the unreliable group includes every participant who failed at least one attention check. Many of these people may be otherwise reliable respondents who made a single mistake, misunderstood the question, or were inattentive for a small part of the survey. They are not, however, systematically giving false information.

Figure 11.6 illustrates the difference between unreliable and reliable respondents across all three benchmarking questions. You can compare it to Figure 11.4, which uses the more lenient threshold. When comparing the two figures, we see that the percentage of reliable respondents who claim to engage in the benchmarking activities moves closer to the true population value, 6.9%. At the same time, however, the difference between the unreliable and reliable groups shrinks because the unreliable group contains participants who may have made just a single error in the study but are otherwise good respondents.

Bar chart showing fraud validation rates with stringent threshold across three platforms comparing Unreliable participants (66.15%, 61.98%, 54.03%) versus Reliable participants (8.09%, 6.76%, 9.93%) against an expected population rate of approximately 6.89%
Figure 11.6. Rates of engaging in at least 1 of the 3 benchmarking activities (scuba-diving, Tesla ownership, or vegetarianism) among three groups when using a more stringent threshold. Endorsement of even 1 out 5 yea-saying questions qualifies for being categorized as unreliable or fraudulent.

The smaller difference between groups demonstrates an important tradeoff in setting exclusion criteria. While being stringent removes more potentially unreliable participants it also increases the number of false positives. False positives in this situation are reliable participants who are categorized as unreliable. So how does a researcher know where to set the threshold?

Line chart showing how reported rates of scuba diving, vegetarianism, and Tesla ownership change as exclusion thresholds become more stringent, with scuba diving dropping from 10.52% to 2.51%, vegetarianism from 8.77% to 4.11%, and Tesla ownership from 7.07% to 1.82%
Figure 11.7. How the reported rate of rare behaviors changes as people fail more yea-saying questions.

The answer depends on the study's objectives. Figure 11.7 provides a visualization of how different thresholds affect data quality. Moving from left to right on the x-axis, the figure shows increasingly stringent thresholds, from only removing participants who failed all five attention checks to removing anyone who failed even a single attention check. The y-axis shows the percentage of participants reporting each benchmark behavior in the cleaned dataset.

What is particularly revealing is how the pattern changes across exclusion thresholds. At the most lenient threshold, the reported rates of benchmarking behaviors are not only inflated but the opposite of what exists in the population. People report scuba diving at a higher rate than following a vegetarian diet, despite scuba diving being much rarer in the population.

Yet as the threshold becomes more stringent, the data corrects itself. When participants who failed two or more attention checks are removed, the pattern starts to align with the population. Following a vegetarian diet is now reported more frequently than scuba diving. But it's only at the most stringent threshold that the data approaches population benchmarks, with scuba diving reported at 2.51%, Tesla ownership at 1.82%, and vegetarianism at 4.11%. All of these numbers are much closer to their population base rates.

With this understanding, we can see that Figure 11.7 illustrates the trade-offs researchers face. Using more stringent exclusion criteria brings the data closer to ground truth but at the cost of removing valid participants. This tradeoff holds several important lessons about data quality.

First, participants are not binary, either all bad or all good. Participants and their data exist on a spectrum of quality. Some percentage of otherwise reliable respondents will lose focus and fail an attention check while otherwise performing well across the rest of the study. Truly unreliable respondents, on the other hand, are likely to fail multiple attention checks.

Second, it is often wise to take a fit-for-purpose approach to data quality. Different research questions require different levels of data scrutiny.

When research aims to examine the association between variables or test an experimental effect, a few percentage points of inattentive data are unlikely to threaten validity. For these studies, you can often use a threshold like failing two out of five attention check questions. Indeed, Reavey et al., (2024) found their expected effects using this threshold. The goal of a more lenient threshold is to remove the most unreliable participants while retaining those who may have made an occasional error.

If, on the other hand, the research aims to describe people's behavior or to uncover the prevalence of rare events—like the CDC study on dangerous cleaning practices—a more stringent exclusion threshold should be used. This is because even a small percentage of unreliable responses can dramatically inflate the prevalence of rare behaviors. So, to protect the data you might use a threshold of failing one out of five questions, even though it carries a higher risk of false positives.

No matter how strict your threshold is some unreliable participants may slip through. Therefore, when the goal of the research is to estimate the prevalence of rare behaviors, we recommend another step that goes beyond asking attention check questions: verifying people's responses with open ended questions.

Open-Ended Validation

For descriptive studies that require precision in their point estimates, we recommend following up on people's responses to multiple choice questions with an open-ended item that asks them to describe their experience. The value of this approach can be seen in the study described in Chapter 10, where people claimed to have drank bleach during the COVID-19 pandemic (e.g., Litman et al., 2023).

In the attempted replication study, 15% of the participants reported behaviors like drinking or gargling bleach, using household cleaners on their bare skin, or inhaling bleach vapors to prevent COVID-19. After applying data quality techniques like those outlined above, the rate dropped to around 1%. However, when the behavior in question defies logic like drinking bleach it is critical to understand: is this 1% real or is it an artifact of unreliable participants who were not caught by the attention checks?

To answer that question, everyone who passed the attention checks and reported a dangerous behavior was asked to describe what happened in their own words. Among these participants, not a single person described drinking bleach. What they did describe were misunderstandings and misreadings of the question.

For example, when asked, "Did you drink or gargle soapy water to prevent COVID-19?" one participant, who passed all attention checks, answered 'yes.' When asked to describe what happened, they wrote: "My mother made me wash my mouth out with soap water because I was cursing, and I accidentally swallowed some."

This response reveals that the participant read the part of the question that mentioned drinking soapy water but apparently did not read the qualifier about whether this was done to prevent COVID-19. Without the open-ended question, the researchers would have incorrectly counted this as evidence that people were drinking soapy water as a COVID prevention strategy.

Another person who said 'yes' to the question, "Did you use household cleaner to clean or disinfect bare hands or skin to prevent COVID-19?" wrote: "I washed my hands with antibacterial soap." Regular handwashing with antibacterial soap, however, is not the dangerous behavior the CDC was concerned about. Instead, it appears, this person thought antibacterial soap was an example of a "household cleaner."

When another participant was asked, "Did you inhale the vapors of household cleaners like bleach to prevent COVID-19?" they answered 'yes.' For the open-ended question they wrote: "I poured product on the floor and began to mop." This response reveals accidental exposure during routine cleaning rather than deliberate inhalation of bleach vapors to prevent COVID-19.

Overall, not a single participant described engaging in the dangerous cleaning practices that were the focus of the study. That means open-ended questions provided additional protection above and beyond what closed-ended attention checks provided. Between both approaches, it is possible to fully vet the responses in a sample even when it comes to a sensitive study that examines the prevalence of rare events. What began as a headline "Millions of Americans are drinking bleach" turned out to be the result of fraud, inattention, misreadings, and misunderstandings—something researchers would never know without proper data quality measures.

Module 11.2

Types of Attention Checks

Explore the questions researchers use to measure attention

In the previous section, we examined how yea-saying questions can be used to remove inattentive and fraudulent participants from a study. Yea-saying questions are not, however, the only type of question used to detect careless responses. In this section, we will review the main options available to behavioral scientists.

What Are Attention Check Questions?

Attention check questions are a tool of quality control in behavioral research. They do exactly what their name suggests: check if participants are reading the study materials and thinking about their answers.

Attention checks are used in studies to identify participants who might harm data quality. As we saw above, removing unreliable participants often yields different results because they provide systematically biased data (e.g., Litman et al., 2023; Oppenheimer et al., 2009). Hence, these checks protect the quality of research.

While attention checks have been around for decades (e.g., Petzel et al., 1973), they have become especially important in the era of online research (Arndt et al., 2022). Because researchers cannot see online participants or control the environment people take studies in, attention checks offer a way to ensure some level of participant engagement.

Types of Attention Checks

Researchers commonly use several kinds of attention check questions. Some directly measure attention, while others check for people's willingness to follow instructions or whether a study's manipulation worked as intended. Attention checks can also vary in whether they tap into attention, memory, or language skills. Below, we discuss various forms that attention checks can take, when each is used, and the potential tradeoffs of using different forms of attention checks.

Instructed Response Items

Instructed response items tell participants how to respond. They might, for example, say "select agree to show you're paying attention" (Gummer et al. 2021; Meade & Craig 2012).

Researchers often embed instructed response items within groups of similarly structured questions, as shown in Figure 11.8. Participants who are unengaged will likely miss these instructions.

Survey matrix question asking how positive or negative participants feel about terms including Fire Eater, Fire Dancer, Fire Performers, an attention check item reading Please select somewhat positive, Flame Throwers, and Pyromancer, with response options from Extremely negative to Extremely positive
Figure 11.8. A matrix of multiple-choice questions is often a good place to embed instructed response item attention check.

Longer versions of items that instruct participants on what to do often measure whether participants have carefully read instructions or long paragraphs of text (Oppenheimer et al., 2009). They often look like this:

"Most modern theories of decision making recognize that decisions do not take place in a vacuum. Individual preferences and knowledge, along with situational variables, can greatly influence the decision-making process. In order to facilitate our research on decision making we are interested in knowing certain factors about you, the decision maker. Specifically, we are interested in whether you actually take the time to read the directions. If not, then some of our manipulations that rely on change in the instructions will be ineffective. So in order to demonstrate that you have read the instructions please ignore the question text below and select the third answer from the bottom of the list as your answer.

Which of the following is your favorite hobby?
Fishing
Movies
Gardening
Reading instructions
Walking
Exercise
Music
Do not enjoy hobbies
Other

If you read the entire question, then you likely noticed the instructions to select a specific answer, regardless of what your favorite hobby is. If, on the other hand, you skimmed the question or skipped the instructions, then you, like the participants who fail these items, probably provided the incorrect answer.

In our experience, it is best to avoid long instructional items like the one above. This is because the items do not accurately separate participants who are paying attention from those who are not. In studies where it is important that participants carefully read and comprehend long passages, comprehension checks are a better option. Unlike instructional items that rely on arbitrary passages for the sole purpose of checking for attention, comprehension checks are tied to the stimulus materials of the actual study.

Comprehension Checks

Comprehension checks test whether participants understand the content of the study. They often focus on whether participants grasp what they are supposed to do by asking a few questions about the instructions or having participants explain the task in their own words. People who fail these items will probably perform poorly.

The instructions below are from a well-known economic task called the "Dictator game." After reading instructions like these, participants may be asked the three closed-ended questions that follow, assessing their comprehension of the instructions.

This task is about dividing money between yourself and another person to whom you are randomly matched. You do not know this other person and you are unlikely to ever meet him/her.

You have been randomly assigned the role of the "allocator". The other person is in the role of the "recipient".

At the start of the task, you will be endowed with $100 and the recipient is endowed with $0. You can decide how much of your $100 to transfer to the recipient. You can choose any amount between $0 and $100.

After you make your proposal, the recipient's task is to decide whether to accept or reject your offer. Your payment and the responder's final payment depends on the responder's decision.

If the responder accepts, you and the recipient both receive the amount contained within your proposal. That is, you and the responder receive the amounts that you allocated to both of you.

If the responder rejects, however, neither you nor the responder receives any money; that is, both of you get $0."

  1. Which role have you been assigned to in the upcoming task? [Allocator, Recipient, Judge, Observer]
  2. How much money will you be endowed with at the start of the task? [$0, $1, $10, $100]
  3. What happens if the person you are paired with rejects the proposal? [We both receive $0, We split the money 50/50, I make a second offer, The person makes a counterproposal]

Such comprehension checks are best used in cognitive psychological experiments or any study that does not contain a long list of surveys questions. This is because studies that lack long lists of questions do not lend themselves to easily embedding yea-saying questions or instructional manipulation items into the survey flow. Instead, for experiments and other similar studies, adding several questions that check for people's comprehension of the study's instructions and stimulus materials is a good way to measure attention.

Manipulation Checks

Another way to use comprehension checks is to see whether respondents notice key elements of the experimental manipulation. As an example, imagine participants are asked to read a vignette that manipulates a person's occupation. Half of participants are told that a man works as an accountant and the other half are told he is a janitor. A manipulation check may ask participants to identify the man's occupation.

"What was John's occupation?"
Accountant
Janitor
Entrepreneur
School teacher
Unemployed

Sometimes manipulation checks work at a group level rather than at the individual level. In such cases, manipulation checks examine whether the experimental manipulation produced a psychological difference between groups. Or it may assess how the information affects participants psychologically. If the purpose of the manipulation was to signal different levels of socioeconomic status (high vs. low), a manipulation check might ask participants:

How prestigious was John's occupation?

1 - Not very prestigious ... 5 - Very Prestigious

Although these latter measures can increase a researchers' confidence that participants are paying attention at a group level, they do not help when looking at individual behavior. In addition, a lack of difference between groups does not necessarily mean participants weren't paying attention. It may simply mean the manipulation was not effective (see Hauser et al., 2018).

Attention Checks that Pull for Yea-Saying

As we saw in the previous module, yea-saying questions are a type of attention check that takes advantage of a behavior often exhibited by fraudulent and inattentive participants: their tendency to say "yes" and "agree" to most questions.

Yea-saying questions ask about behaviors or experiences that are either extremely rare or completely impossible. For example, asking participants if they have visited a small town with just a handful of residents, if they own items that do not exist, or if they have experienced statistically improbable events are all forms of yea-saying questions.

The effectiveness of these questions comes from how they tap into the strategic behavior of fraudulent participants. As we saw in Module 11.1, respondents working in click farms, not paying much attention, or otherwise attempting to game the survey know that agreeing with most statements helps them qualify for more studies. By presenting plausible-sounding but verifiably false questions, researchers can identify participants who are employing this strategy.

A key advantage of yea-saying questions is that they can appear naturally within the survey. Rather than using obviously artificial checks like, "Have you ever had a fatal heart attack while watching TV?" researchers can create a subtle measure that blends into the survey's content. For instance, asking about fictional products alongside real ones ("Which of these haircare brands have you used recently?") or inquiring about nonexistent venues ("Have you visited the Meridian Theater in Chicago during the past year?") makes these items harder to detect.

Attention checks should aim to avoid the potential negative reactions that obvious attention checks can provoke, which range from participants feeling like their intelligence is being insulted to feeling the researchers are trying to trick them (e.g., Hauser & Schwarz, 2015; see Shamon and Berning, 2020).

Another advantage of yea-saying questions is their flexibility. Researchers can easily create new variations tailored to their specific research context, making them difficult for fraudulent participants to recognize and circumvent. While traditional attention checks often become recognizable to experienced survey-takers, yea-saying questions can be continually refreshed with new content while maintaining their effectiveness.

Open-Ended Checks

Open-ended questions ask participants to write a response in their own words. Often, these items work best when they are relevant to the study topic, but a general item our team sometimes uses asks participants to "Please describe the last thing you remember cooking and where you cooked it. Write at least one complete sentence."

Open-ended questions help researchers spot concerning behaviors like copying text from the internet or from AI, ignoring the question instructions, writing about an unrelated topic, providing generic responses common to click farms like "NICE" or "great product" in response to a question about breakfast, or using automated tools to fill in answers. These questions are especially good at catching people engaged in survey fraud, as these people often struggle to write natural-sounding responses. As we learned in Module 11.1, open-ended items can also ask participants to provide context for rare events or unusual behaviors that they endorsed in multiple choice items, checking the validity of their responses.

Issues With Attention Checks

Psychological Effects on Participants

Participants sometimes interpret attention checks as an attempt to trick them (Hauser & Schwarz, 2015; Silber et al., 2022). Thus, it should not be surprising to learn that attention checks can influence participants' behavior later in the study.

Some participants react to attention checks by deliberately failing the questions in an act of defiance (Silber et al., 2022). Others, tend to think more carefully about subsequent tasks (Hauser & Schwarz, 2015). And some see the questions as a challenge and pay more attention afterward (Kung et al., 2017; Shamon & Benning, 2020). These reactions show that attention checks are not neutral measurement tools. Instead, they are measures that can have a psychological effect on participants, just like everything else within a study (e.g., Hauser et al., 2018).

False Positives

One of the biggest risks with attention checks is false positives, or, incorrectly marking reliable participants as unreliable. As touched upon above, this can happen because participants misunderstand the question, inadvertently select the wrong answer, or make some other mistake. For example, someone who is asked, "Have you ever been to McMullen, Alabama?" might think of a similar-sounding town they really have visited. Or, someone who is asked about using fictional products might confuse the name with real brands.

Beyond individual questions, the risk of false positives increases when researchers set exclusion criteria that are too stringent. As we saw in Module 11.1, requiring participants to pass all attention checks can provide modest improvements in data quality but at the expense of removing many people who provided otherwise useable data.

Mitigating Negative Effects

How can researchers avoid the negative consequences outlined above? We believe the best way is to adopt the balanced approach outlined in Module 11.1: use multiple, relatively simple checks and require participants to fail at least two items before being classified as problematic.

Three characteristics make an attention check effective: 1) writing a clear question, 2) giving it one correct answer, and 3) ensuring the question measures attention rather than another construct.

Write Clear Questions

Attention checks should verify people are paying attention, not trick them. This means attention checks should be easy to pass for people who are paying attention. Effective questions begin with direct prompts that stand on their own. In several studies we have seen, researchers seem to assume the goal of attention checks is to trick participants. Questions like "If John's father's brother is married to Mary's sister, how are John and Mary related?" or "If today is Tuesday and yesterday was Monday, what day will it be three days after tomorrow?" assess much more than attention. They require participants to rely on logic and sequencing, working memory, and careful reading. Because these questions may be confusing, they raise the risk of false positives.

Measure Attention, Not other Constructs

Many researchers mistakenly use questions that assess memory, general knowledge, intelligence, or other cognitive constructs instead of attention. For example, in a study where participants read about two friends who have lunch to discuss work issues, asking "How many oranges did Steven have before trading with Josephine?" tests memory of peripheral details rather than attention to the study itself. Participants might have read but failed to recall minor details. Another way of putting this is, if the outcome of an attention question correlates with cognitive ability it is not a good attention question.

Consider another example that meets the elements discussed so far—clear prompt, single answer—but may, nevertheless, be challenging for participants: "How many times have you seen the 1997 film Titanic starring Brad Pitt?" Regardless of how many times someone has seen Titanic, the correct answer must be zero. That is because the movie starred Leonardo DiCaprio, not Brad Pitt. But by assessing general knowledge and memory together, the question relies on participant's prior knowledge, making it a poor measure of attention.

A Strategy for Implementing Attention Checks

So, what should you do with the information above? What is the best way to protect data quality?

Based on our experience with tens of thousands of researchers who have conducted studies with millions of participants, we recommend a strategy that balances simplicity with effectiveness, making it applicable across a range of research contexts.

For a 10–15-minute study, we recommend including 4 or 5 attention checks plus one open-ended item. As shown in Module 11.1, this combination has proven highly effective at identifying participants who provide low-quality data. The number can be adjusted based on your study's length and the sensitivity of your data, but this basic framework serves as a solid foundation for most research.

When using this approach, we suggest embedding your attention checks within groups of similar-looking items, so they don't stand out. For example, in a study about social media usage you might include the question "Do you have an active Vine account?" among questions about social media platforms. Since Vine shut down in 2017, anyone answering "Yes" needs closer examination. This approach feels natural to participants while still effectively measuring inattention.

When assessing people's performance, we recommend removing participants who fail more than one attention check or who provide an unusable open-ended answer. If the data you are gathering is descriptive in nature, you can require participants to pass all of your attention checks before they are included in the analyses.

Summary

Throughout this chapter, we have explored how to identify the importance of data quality in behavioral research and the various methods researchers can use to identify and remove unreliable responses. As we have seen, low-quality data is a persistent challenge, but researchers have several tools to spot low quality responses and remove them from a dataset.

Among the effective approaches to protecting data quality we have learned about yea-saying questions are especially important. These questions leverage the tendency of fraudulent respondents to agree with most statements, allowing researchers to identify people providing unreliable answers. When used in combination with benchmark validation questions researchers can compare participants' response patterns to known population parameters, confirming the effectiveness of attention checks and validating the quality of the remaining data.

Beyond yea-saying questions, there are various forms that other attention checks can take including instructed response items, comprehension checks, and manipulation checks. Each of these question types has advantages in different contexts.

By incorporating multiple attention checks, validating them with benchmark questions, and using open-ended items to verify unusual responses, researchers can significantly improve data quality. However, these strategies work best within a fit-for-purpose approach to data quality. Different research questions require different levels of quality control—studies examining associations between variables may tolerate some inattentive data, while research on rare behaviors demands more stringent screening and additional validation through open-ended responses.

When implementing these techniques, it is important to carefully consider potential trade-offs. While more stringent exclusion criteria bring data closer to ground truth, they also increase the risk of false positives, where researchers remove valid participants who may have made a single error. Finding the right balance requires understanding both your research objectives and the spectrum of data quality that exists in any sample.

As online research continues to evolve, maintaining high data quality standards remains essential. The tools and strategies outlined in this chapter provide a practical framework for researchers to ensure their findings accurately reflect reality rather than artifacts of inattention or fraud. Ultimately, high data quality advances our collective pursuit of scientific knowledge.

Additional Readings

  • Arndt, A. D., Ford, J. B., Babin, B. J., & Luong, V. (2022). Collecting samples from online services: How to use screeners to improve data quality. International Journal of Research in Marketing, 39(1), 117-133.
  • Berinsky, A. J., Margolis, M. F., Sances, M. W., & Warshaw, C. (2021). Using screeners to measure respondent attention on self-administered surveys: Which items and how many?. Political Science Research and Methods, 9(2), 430-437.
  • Kay, C. S., & Saucier, G. (2023). The Comprehensive Infrequency/Frequency Item Repository (CIFR): An online database of items for detecting careless/insufficient-effort responders in survey data. Personality and Individual Differences, 205, 112073.
  • Litman, L., Rosen, Z., Hartman, R., Rosenzweig, C., Weinberger-Litman, S. L., Moss, A. J., & Robinson, J. (2023). Did people really drink bleach to prevent COVID-19? A guide for protecting survey data against problematic respondents. Plos one, 18(7), e0287837.
  • Reavey, B., Bruggemann, P., Rosenzweig, C., & Litman, L. (2024). Sentry In-Survey: A tool for preventing survey fraud. [Manuscript under review. Contact authors for copy].

Frequently Asked Questions

What are attention checks in behavioral research?

Attention checks are special questions incorporated into studies to detect inattention and fraudulent responding. They verify that participants are reading study materials and thinking about their answers. Common types include instructed response items, comprehension checks, yea-saying questions, and open-ended validation questions.

What is yea-saying and how do researchers detect it?

Yea-saying is a common strategy employed by fraudulent participants who say 'yes' to nearly every question. Researchers detect this by inserting questions about impossible or highly unlikely scenarios, such as visiting venues that have been demolished or using products that don't exist. People who agree with several of these questions are flagged as unreliable.

How do benchmark validation questions work?

Benchmark validation questions ask about behaviors with known population rates, such as Tesla ownership (approximately 1%) or following a vegetarian diet (approximately 5%). By comparing participants' responses to these known rates, researchers can validate whether their attention checks are correctly identifying unreliable respondents.

What is the fit-for-purpose approach to data quality?

The fit-for-purpose approach recognizes that different research questions require different levels of quality control. Studies examining associations between variables may tolerate some inattentive data with a threshold of failing 2 out of 5 attention checks. Research on rare behaviors demands more stringent screening, potentially requiring participants to pass all attention checks.

Why are open-ended validation questions important?

Open-ended validation questions provide additional protection beyond closed-ended attention checks. They help researchers verify unusual responses by asking participants to describe their experiences in their own words. This approach can reveal misunderstandings, misreadings, or fraudulent responses that multiple-choice attention checks might miss.

Key Takeaways

  • Data quality problems are endemic to online research, with 30-40% of responses from online panels often being questionable
  • Attention checks are special questions that detect inattention and fraudulent responding by verifying participants are reading and thinking about their answers
  • Yea-saying is a common fraudulent strategy where participants say 'yes' to nearly every question; researchers detect this using questions about impossible or highly unlikely scenarios
  • Benchmark validation questions ask about behaviors with known population rates (like Tesla ownership at ~1%) to verify that attention checks are correctly identifying unreliable respondents
  • Participants exist on a spectrum of quality—some otherwise reliable respondents may fail a single attention check while truly unreliable respondents fail multiple checks
  • The fit-for-purpose approach recognizes that different research questions require different levels of quality control
  • For studies examining associations between variables, a threshold of failing 2 out of 5 attention checks is often appropriate
  • For descriptive research on rare behaviors, more stringent thresholds (failing 1 out of 5) should be used, along with open-ended validation
  • Open-ended validation questions provide additional protection by asking participants to describe their experiences, revealing misunderstandings that multiple-choice questions miss
  • Effective attention checks should be clear, have one correct answer, and measure attention—not memory, general knowledge, or intelligence
  • For a 10-15 minute study, include 4-5 attention checks plus one open-ended item, embedded naturally within similar-looking questions