By Aaron Moss, PhD & Leib Litman, PhD
If you studied human behavior 30 years ago, your options for finding people to take your studies were limited. If you worked at a university, you might be able to rely on a student subject pool, or, with a lot of legwork, identify people in the community to participate in your studies. If you worked in the marketing industry, your company might conduct a focus group or hire an outside firm to conduct a phone survey, a mail survey or an in-person study with your target audience. Either way, the options for finding participants were slow, costly and restricted.
The internet changed all that.
Due to technology, research in the social and behavioral sciences has undergone a rapid revolution. Today, researchers can easily identify participants, quickly collect data and affordably recruit hard-to-reach groups. Online studies allow researchers to examine human behavior in exciting ways and at scales not possible in the past.
Even though online research has benefits for researchers, businesses, and science, it also presents some unique challenges. When conducting studies online, researchers must direct extra attention to data quality, an important and complex issue. So, let’s take a deeper look at what data quality is and why it’s important.
Data quality is a complex and multifaceted construct, making it difficult to precisely define. Nevertheless, perhaps one of the simplest definitions of data quality is that quality data 1) are fit for their intended purpose, and 2) have a close relationship with the construct they are intended to measure.
This definition may sound a bit abstract, so consider the example below.
Imagine you are a data scientist at a music streaming service such as Spotify. Your job is to use data from the songs people have listened to in the past to predict what kind of music they might listen to in the future.
The data you have in this case — songs people have listened to in the past — are likely high quality because the music people have listened to in the past probably predicts what they want to hear in the future.
The data are also likely high quality because the music people have listened to in the past is directly related to the construct you’re interested in measuring: musical preferences. In other words, your data possess the defining characteristics of high-quality data.
Generally speaking, your data (in this case) can suffer only from factors that cause it to lack completeness or credibility.
For example, if a user listens to music on your streaming service only once every six months, your data represent an incomplete picture of your user’s musical preferences. With such limited data, it’s difficult to ascertain what the user truly likes and dislikes.
A second way your data quality may suffer is because of a lack of credibility or consistency. Suppose a user allowed a friend or family member who likes very different music to use their account. Your data for this user would now be an inaccurate and perhaps inconsistent representation of their preferences, meaning the quality of your data would be lower as a result.
Although assessing data quality is relatively easy in the scenario above, measuring the quality of data collected in online research that requires people to answer survey questions, evaluate products, engage with psychological manipulations, walkthrough user-testing sessions, or reflect on their past experiences is often much more difficult.
Researchers typically assess data quality at both the group level and the individual level. At both levels, researchers look for evidence that the data are: 1) consistent, 2) correct, 3) complete and 4) credible.
Evaluating the consistency of people’s responses at the group level often means examining measures of internal reliability, such as a Chronbach’s alpha score.
Measures of reliability tell the researcher how well a test measures what it should measure. For validated measures that have been used before, a low-reliability score can indicate inconsistent responses from research participants.
Researchers assess the consistency of responses at the individual level by identifying either logical contradictions in people’s responses or inconsistent answers to specific questions designed to elicit the same information (e.g., “What is your age?” “What year were you born?”). People who provide many inconsistent responses are often removed from the dataset.
What does it mean for data to be “correct”? Simply put, correct data are data that accurately measure a construct of interest. A construct might be happiness, customer satisfaction, people’s intention to buy a new product or something as complex as feelings of regret.
Regardless of the specific construct, researchers typically assess the group-level correctness of their data by examining whether the data are related to similar constructs they should relate to (convergent validity) and dissimilar from constructs they should not relate to (discriminant validity).
For example, a researcher studying life satisfaction might look for evidence that people who say they are satisfied with their life also say they are happy and not depressed.
Assessing the correctness of data at the individual level involves evaluating whether people provided consistent responses to similar items. This is not possible for all measures, but when possible, researchers may administer questions that are either synonymous (“On most days I feel happy” and “On most days I am in a good mood”) or antonymous (“I often feel good” and “I seldom feel good”) and examine the distance between participant responses to each item.
At the group level, complete datasets are those where most people answer all items in the survey and those who start the survey finish it (i.e., low attrition).
At the individual level, complete responses often mean the same thing. But, a researcher may specify before collecting data that people must have seen or responded to key questions within the study, such as manipulation, manipulation check or important outcome measures.
Credible datasets are those in which respondents make a good faith effort to answer the questions in a study.
At the group level, credibility can sometimes be assessed by comparing the effect size of specific manipulations to those previously obtained with other samples. At the individual level, researchers have several tools for detecting participant responses that lack credibility.
These tools range from measures designed to detect overly positive or negative self-presentation to a variety of measures assessing people’s attention, effort, anomalous response patterns, speed through the survey and deliberate misrepresentation of demographic information.
Because data quality is a complex construct, researchers who collect data over the Internet strive to ensure the credibility of individual participant responses.
Everyone knows you can’t make good decisions with poor-quality data. The idiom “garbage in, garbage out” has traveled far beyond the realm of computer science — where it originated — because it captures the idea that if you don’t begin with good information, you can’t make effective decisions. But how, exactly, does low-quality data impair decision-making?
Low-quality datasets can lead researchers to make bad decisions by inflating the relationship between variables or making it appear two variables are related when they are not.
Spurious relationships that capitalize on might lead a healthcare analyst to determine people with a specific set of symptoms prefer one treatment plan to another when people really prefer neither plan. Spurious relationships can also allow a university researcher to find results that later studies cannot replicate.
A low-quality dataset can introduce noise (i.e., error variance) that obscures or weakens the relationship between variables.
Noise within a dataset may cause a marketing team to determine there is no difference in the effectiveness of various messages intended to increase brand awareness although there actually is. Noise may also lead a university researcher to decide there is no need to follow up on an exploratory study with an experiment because the primary variables of interest are not related.
Regardless of the exact situation, noisy data produced by inattentive participants can cause researchers to overlook relationships that might actually exist. In this way, spurious relationships between variables can cause researchers and businesses to invest money in some future course of action that won’t pay dividends because the study’s findings are not reliable.
CloudResearch clients know they can rely on quality data. That is why more than 3,000 academic institutions, multiple Fortune 500 companies, and federal agencies alike all trust CloudResearch with data collection. Get in touch today to learn how we can make your next research project a success.