What Is Data Quality and Why Is It Important?

CloudResearch

The CloudResearch Guide to Data Quality, Part 1:
Defining and Assessing Data Quality in Online Research

If you were a researcher studying human behavior 30 years ago, your options for identifying participants for your studies were limited. If you worked at a university, you might be able to rely on a student subject pool, or, with a lot of legwork, identify people in the community to participate in your studies. If you worked in the marketing industry, your company might conduct a focus group or hire an outside firm to conduct a phone survey, a mail survey or an in-person study with your target audience. Either way, the options for finding participants were slow, costly and restricted.

The internet changed all that. Due to advances in technology, research in the social and behavioral sciences has undergone a rapid revolution. Today, researchers can easily identify participants, quickly collect data and affordably recruit difficult-to-reach groups. Online research makes it possible for researchers to study human behavior in exciting ways and at scales not possible in the past.

Although online research has myriad benefits for individuals, businesses and science as a whole, it also presents some unique challenges. When conducting studies online, researchers must direct extra attention to data quality, an important and complex issue. So let’s take a deeper look at what data quality is and why it’s important.


What Is Data Quality?

Data quality is a complex and multifaceted construct, making it difficult to precisely define. Nevertheless, perhaps one of the simplest definitions of data quality is that quality data 1) are fit for their intended purpose, and 2) have a close relationship with the construct they are intended to measure. This definition may sound a bit abstract, so consider the example below.

An Example of Data Quality in Market Research

Imagine you are a data scientist at a music streaming service such as Spotify. Your job is to use data from the songs people have listened to in the past to predict what kind of music they might listen to in the future.

In this case, the data you have — songs people have listened to in the past — likely are high quality, since the music people have listened to in the past is good for predicting what they might want to hear in the future. In addition, the data are also likely high quality because the music people have listened to in the past has a direct relationship with the construct you’re interested in measuring: people’s musical preferences. Therefore, your data possess the defining characteristics of high-quality data.

What might harm the quality of your data in this example? Generally speaking, your data (in this case) can suffer only from factors that cause it to lack completeness or credibility. For example, if a user listens to music on your streaming service only once every six months, your data represent an incomplete picture of your user’s musical preferences. With such limited data, it’s difficult to ascertain what the user truly likes and dislikes. A second way your data quality may suffer is because of a lack of credibility or consistency. Suppose a user allowed a friend or family member who likes very different music to use their account. Your data for this user would now be an inaccurate and perhaps inconsistent representation of their preferences, and your data would be of lower quality as a result.

Although assessing data quality is relatively straightforward in the above scenario, measuring the quality of data collected in online research requiring participants to answer survey questions, evaluate products, engage with psychological manipulations, walk through user-testing sessions or reflect on their past experiences is often a more difficult task. That is because the correspondence between the construct the researcher wants to measure and the tools used to measure it is far looser. In addition, while the dataset in the music streaming example is produced as a result of the user’s listening behavior — a user who was motivated to seek out and use the streaming service — the data in online survey studies come from research participants the researcher finds online and compensates for their time. As we will see, both measurement error and participant motivation can cause data quality problems.


4 Dimensions Used to Assess Data Quality

Researchers who collect data in online studies typically assess data quality at both the group level and the individual level. At both levels, researchers look for evidence that the data are: 1) consistent, 2) correct, 3) complete and 4) credible.

Consistency

Researchers evaluate the consistency of participant responses at the group level by examining measures of internal reliability, such as a Chronbach’s alpha score. Measures of reliability tell the researcher how well a test measures what it should measure. For validated measures that have been used before, a low reliability score can indicate inconsistent responses from research participants.

Researchers assess consistency at the individual level by identifying either logical inconsistencies in participant responses or contradictory answers to specific questions designed to elicit the same information (e.g., “What is your age?” “What year were you born?”). Participants who provide largely inconsistent responses are often removed from the dataset.

Correctness

What does it mean for data to be “correct”? Simply put, correct data are data that accurately tap into the researcher’s construct of interest. A construct might be happiness, customer satisfaction, people’s intention to buy a new product or something as complex as feelings of regret.

Regardless of the specific construct, researchers typically assess the correctness of their data at the group level by examining how related the data are to similar measures they should relate to (convergent validity) and how dissimilar they are to measures they should relate to (discriminant validity). For example, a researcher studying life satisfaction might assess the correctness of the dataset by assessing whether people who say they are satisfied with their life also generally say they are happy and not depressed.

Researchers assess the correctness of data at the individual level by ascertaining whether participants provided consistent responses to similar items. Although this is not possible for all measures, when possible, researchers may administer questions that are either synonymous (“On most days I feel happy” and “On most days I am in a good mood”) or antonymous (“I often feel good” and “I seldom feel good”) and examine the distance between participant responses to each item.

Completeness

At the group level, complete datasets are those where most participants answer all items in the survey and most people who start the survey finish it (i.e., low attrition). At the individual level, complete responses often mean the same thing, but a researcher may specify before collecting data that participants must have seen or responded to key questions within the study, such as manipulation, manipulation check or primary dependent measures.

Credibility

Credible datasets are indicated by responses where participants make a good faith effort to answer the questions in a study. At the group level, credibility sometimes can be assessed by comparing the effect size of specific manipulations to those previously obtained with other samples. Researchers have several tools for detecting participant responses that lack credibility at the individual level. These tools range from measures designed to detect overly positive or negative self-presentation to a variety of measures assessing participants’ attention, effort, anomalous response patterns, speed through the survey and deliberate misrepresentation of demographic or other information.

Because data quality is a complex construct, researchers who collect data over the internet strive to ensure the credibility of individual participant responses. Measures of participant credibility can be represented in a hierarchy that runs from assessments of participant attention to participant effort to participant ability.

  • Attention: The first step on the ladder to quality data is attention. Are participants making a good faith effort to provide honest responses? Attention is the minimum criterion necessary for quality data.
  • Effort: The second element of data quality is effort. Measurements of participant effort typically tap into how much participants are willing to engage with the measures and manipulations in a study.
  • Ability: The highest element of data quality is ability. Researchers assess participants’ higher-order functioning: How well do participants solve problems? How creative are they when asked to complete a novel task? How accurate are their predictions? Any time researchers assess how well participants perform something, they are assessing participants’ abilities.

Why Is Data Quality Important to an Organization or Researcher?

Everyone knows you can’t make good decisions with bad data. The idiom “garbage in, garbage out” has traveled far beyond the realm of computer science — where it originated — because it captures the idea that if you don’t begin with good information, you can’t make effective decisions. But how, exactly, does low quality data impair decision-making?

One way is by introducing noise (i.e., error variance) that obscures or weakens the relationship between variables. Noise within a dataset may cause a marketing team to determine there is no difference in the effectiveness of various messages intended to increase brand awareness; it might lead a university researcher to decide there is no need to follow up an exploratory study with an experiment because the primary variables of interest are not related. Regardless of the exact situation, noisy data produced by inattentive participants can cause researchers to overlook relationships that might actually exist.

The second way low-quality datasets can lead researchers to make bad decisions is when they inflate the relationship between variables or make it appear two variables are related when they are not. Spurious relationships (those that capitalize on chance and don’t hold up to repeated observations) between variables in a study might cause a healthcare analyst to determine people with a specific set of symptoms prefer one treatment plan to another; they can allow a university researcher to find results later studies cannot replicate. Either way, spurious relationships between variables can cause researchers and businesses to invest money in some future course of action that won’t pay dividends because the study’s findings are not reliable.


CloudResearch clients know they can rely on quality data. That is why more than 3,000 academic institutions, multiple Fortune 500 companies and federal agencies alike all trust CloudResearch with data collection. Get in touch today to learn how we can make your next research project a success.

Continue Reading:

Related Articles

SUBSCRIBE TO RECEIVE UPDATES