Chapter 4

Measurement

Learn how behavioral scientists measure psychological constructs

Aaron J. Moss, PhD, Leib Litman, PhD, & Jonathan Robinson, PhD ~30 min read

Introduction

Imagine scrolling through social media and seeing this headline: "Nearly Half of Black Americans Don't Think It's Okay to Be White." The story cites a major polling company. The statistics seem clear. The methodology appears sound. Would you question it? Or would you, like millions of others, accept the headline as fact?

This isn't a hypothetical scenario. In February 2023, Rasmussen Reports released a poll showing that only 53% of Black Americans agreed with the statement, "It's okay to be White" (Rasmussen Reports, 2023). The poll spread rapidly through social media and news outlets, fueling debates about racial divisions in America. Those debates intensified when the creator of Dilbert—one of America's most popular comic strips—used the poll to make incendiary statements that eventually ended his career (Medina, 2023). But there was a fundamental problem: the survey question didn't measure what it claimed to measure.

When this textbook's authors conducted a follow-up study, we discovered that people who disagreed with the statement weren't expressing negative attitudes toward White people. Instead, they were confused by the question's meaning or responding to its use as a political slogan. When the question was redesigned to be more clear, the supposed racial divisions disappeared. As one respondent wrote, "Color should not matter in this day and age we are all the same inside" (Hartman et al., 2023). This sentiment was echoed by the overwhelming majority of participants in the survey.

Although this story is about polling and race relations, it is also a reminder of why measurement matters. In behavioral research, the questions researchers ask and how they ask them have consequences. Poor measurement doesn't just produce bad data—it can shape public opinion, mislead policy makers, and affect how people think about others.

As we learn about measurement in this chapter, keep this case in mind. It illustrates three important principles:

  1. Single questions rarely capture complex attitudes accurately.
  2. How you ask a question matters as much as what you ask.
  3. Good measures need to be systematically tested for reliability and validity.

While these principles might seem like technical details, this chapter's opening shows why they matter. Getting measurement right is about more than good science; it's about responsibility to truth and the people influenced by a study's findings.

The modules in this chapter will walk through the process of finding, creating, and validating measures for behavioral research. In Module 4.1, we will learn how measurement scales work and how researchers transform abstract psychological constructs into variables they can analyze. Then, in Module 4.2, the text will guide you through finding existing measures and creating your own, including how to use AI tools to generate and refine measurement items. In Module 4.3, we will learn how researchers test their measures for reliability and validity, and finally, Module 4.4 explores the different types of measurement scales (nominal, ordinal, interval, and ratio) and how they affect data analysis.

By the end of this chapter, you will know how to both find and create measurement instruments for behavioral research. This skill is valuable not just for academic research but for anyone who needs to gather systematic information about how people think, feel, and behave. So, let's take a measured step in your research journey!

Chapter Outline

Module 4.1

Working with Scale Instruments

Learn how researchers use questionnaires to measure psychological constructs

In Chapter 2, we introduced tools for measuring four kinds of data in the behavioral sciences: self-reported opinions and attitudes, cognitive performance, physiological responses, and behavioral tracking (see Figure 2.10). In this chapter, we focus on self-report measurements that take the form of a questionnaire, called scale instruments.

A scale instrument is a collection of carefully designed questions that work together to measure a psychological construct. The Generalized Anxiety Disorder scale, or GAD-7, introduced in the previous chapter is an example. Creating these instruments is both an art and a science. It is an art because researchers must find creative ways to capture complex human experiences through carefully constructed questions; and, it is a science because these measures must be systematically tested to ensure they are both reliable and valid. We will use the GAD-7 to see how behavioral scientists measure psychological constructs and how they validate these measures for use in research.

As you will recall, the GAD-7 was designed to assess anxiety. It consists of seven questions that ask how often a person has been bothered by specific problems over the past two weeks. Each item is scored on a scale from 0 ("Not at all") to 3 ("Nearly every day").

For example, one item asks: "Over the last 2 weeks, how often have you been bothered by feeling nervous, anxious, or on edge?" If someone responds, "Not at all," they receive 0 points. If they say, "Several days," they receive 1 point. A response of, "More than half the days" equals 2 points, and "Nearly every day" equals 3 points.

To calculate a person's total anxiety score, a researcher would sum the scores from all seven items. This creates a total score that can range from 0 to 21, with higher scores indicating more severe anxiety. This summed score becomes the operational definition of anxiety.

Table 4.1 shows GAD-7 responses from 10 hypothetical participants. Each row represents one person, and each column represents their response to one of the GAD-7 items. The rightmost column shows total anxiety scores—the sum of people's responses across all items.

Table showing GAD-7 responses and total scores for 10 hypothetical participants. Each row shows a participant's responses (0-3) for all seven items, with total scores ranging from 0 to 18.
Table 4.1. GAD-7 Responses and Total Scores for 10 hypothetical participants.

Looking at the table, there is considerable variation in anxiety. Participant 7 reported no anxiety symptoms (total score = 0), while Participant 9 reported very high anxiety (total score = 18). This variation allows researchers to examine patterns and relationships among people.

In the research activity below, we will learn how scale instruments are used for research. You will work with a dataset of 500 real participants who filled out the GAD-7, along with three other measurement scales. You will learn how to use SPSS to calculate the total score for each participant, how to conduct descriptive statistics on the sample, how to characterize the distribution of anxiety in the sample, and how to create a histogram describing the distribution of anxiety. Through this hands-on experience, you will see how measurement scales allow researchers to measure psychological concepts.

📊

Research Activity 4.1: Working with Measurement Instruments

For this exercise, you will analyze data from four validated instruments that measure depression (PHQ-9; Kroenke et al., 2001), anxiety (GAD-7; Spitzer et al, 2006), trauma (PC-PTSD-5; Prins et al., 2016), and sleep disturbances (ISI; Bastien et al., 2001). Each of these measurement scales assesses a clinical outcome, with higher scores indicating a higher level of disorder. For example, people with higher scores on the depression scale have higher levels of depression, and people with higher scores on the sleep disturbance scale have worse sleep compared to those with lower sleep disturbance scores.

To get started, download the SPSS data file "RITC_DATA_CH04_Measurement.sav" from the Research in the Cloud OSF page: https://osf.io/a8kev/. Once the file is open, you will see 500 rows of data, with each row representing one respondent. You will also see columns that correspond to each question that was asked in Qualtrics or Engage. Figure 4.1 shows several scores from the GAD-7.

SPSS data view showing seven anxiety scale columns (ANX_1 through ANX_7) with participant responses. Annotations indicate that each column represents a single GAD-7 question about anxiety and each row represents a participant's data.
Figure 4.1. Seven anxiety scale columns in the Clinical Study SPSS file.

Calculating Total Scores: Anxiety, Depression, Trauma, and Sleep Disturbance

To calculate the anxiety score for each respondent, you need to add all seven scores together to obtain a total anxiety score. The instructional video available online guides you through creating the total anxiety score and generating a histogram to examine the distribution: https://bit.ly/Ch4_ts. You can also follow the HOW TO instructions in Box 4.1. After creating total anxiety scores, you can repeat the process to calculate total scores for the depression, sleep, and trauma scales.

Box 4.1

How to Calculate Anxiety Scores in SPSS

Follow the steps below to calculate total scores for the anxiety scale. Then, you can apply the same steps to calculate scores for the depression, sleep, and trauma scales.

Open the dataset

  • Open SPSS and navigate to File → Open → Data
  • Find the "RITC_DATA_CH04_Measurement.sav" file from where you downloaded it

Create the total score variable

  • Click "Transform" in the top menu
  • Select "Compute Variable..."
  • In the "Target Variable" field, type "TotalAnxiety"
  • In the "Numeric Expression" field create the following formula: SUM (Anx_1, Anx_2, Anx_3, Anx_4, Anx_5, Anx_6, Anx_7)
  • Click "Ok."

Verify the calculation

  • The new variable should appear at the end of the dataset
  • Examine or "spot check" a few cases to ensure the accuracy of the scores

Create a histogram to visualize the data

  • Click "Graphs" in the top menu
  • Select "Chart Builder"
  • In the gallery at the bottom left, select "Histogram"
  • Drag the histogram icon into the preview area
  • Drag the "TotalAnxiety" variable to the x-axis (horizontal) box
  • Click "OK" to create the figure

Visualizing the Distributions of Anxiety, Depression, Trauma, and Sleep Disturbance from an Online Sample of 500 People

Using either the video or HOW TO box, create a histogram for the GAD-7. It should look like Figure 4.2. This histogram reveals an important pattern in the data known as a positive skew. A positively skewed distribution means that most people reported low anxiety (scores near 0), while progressively fewer people reported high levels of anxiety. In other words, the "tail" of the distribution extends to the right. This pattern shows that although most people experienced minimal symptoms of anxiety, a smaller subset reported moderate to severe anxiety (scores of 15–21). Unlike the marathon times we discussed in the previous chapter—which were approximately normally distributed—clinical outcomes such as anxiety, depression, and many other variables often follow a positively skewed distribution.

Histogram showing the distribution of anxiety scores as measured by the GAD-7. The distribution is positively skewed, with most scores clustered near 0 and fewer high scores.
Figure 4.2. The distribution of anxiety scores, as measured by the GAD-7.

Once you have created a visualization of the anxiety scores, repeat the process for depression, trauma, and sleep disturbances.

📝

Research Portfolio

Portfolio Entry #8: Reporting on the Descriptive Statistics of Anxiety and Other Clinical Measures

After you have generated the histograms for the four variables in the dataset, paste them into your portfolio and write a few sentences about what the histograms show about their distributions. What does the distribution of depression look like? Is the distribution normal or skewed? Why do you think clinical variables, such as anxiety, depression, trauma, and sleep disturbance, are positively skewed?

What the Measures Reveal

Now that you have completed the exercise, let's examine what it reveals about measurement.

Through calculating total anxiety scores, we learned how researchers combine multiple items or questions to measure psychological constructs (e.g., anxiety). Each GAD-7 item captures a different aspect of anxiety, and together they provide a more complete picture of the construct than any single question could.

This exercise also demonstrated the essential connection between theoretical constructs like anxiety and measurable data. Without measurement instruments like the GAD-7, researchers could not study psychological phenomena like anxiety, visualize their distributions, or make meaningful comparisons between people and groups. Anytime researchers want to measure a psychological construct, they need scale instruments like the GAD-7 that help transform people's experiences into a precise measurement.

Now that we know how to measure a variable, let's discuss where to find measurement scales. There are two options. One is to use an existing scale, the other is to create your own.

Module 4.2

Finding and Creating Measures

Learn how to locate existing scales and develop new measurement instruments

Anytime you want to measure a psychological construct, you should start by looking at what measures already exist. Behavioral scientists have developed thousands of validated measures over the years, and many are freely available. There is no need to reinvent the measuring stick.

One place to look for existing measures is in a database. Several of these exist, such as Psychology Tools (https://psychology-tools.com) for clinical scales and PsyToolkit (https://www.psytoolkit.org/survey-library/) and PsyTests (https://psytests.org) for general scales. These websites provide a free library with many validated measures.

The measures in these databases are often organized into categories that reflect different areas of research. For example, on PsyToolkit, under the category of personality, you can find measures of the Big Five we explored in Chapter 1, as well as questionnaires that measure characteristics like impulsivity and the need for cognition (a measure of how much people like to think). In the mental health category, there are validated measures for anxiety, depression, stress, and well-being. There are also categories for social behavior, political attitudes, consumer decision-making, and several other topics.

Beyond measurement repositories, many colleges and universities have subscriptions to databases such as PsycTests and PsycINFO. PsycTests provides access to a wide range of psychological tests and measures, many of which are available for direct use. Each entry includes a description about the purpose of the test, how it is scored, and what it measures, along with background information on its development, including reliability and validity data. PsycINFO, on the other hand, provides access to peer-reviewed publications. Like Google Scholar, PsycINFO contains scholarly articles, books, and dissertations that describe specific measures. By searching for keywords like "self-esteem scale" or "anxiety questionnaire," you can find research that discusses how these measures were developed, validated, or applied.

📊

Research Activity 4.2: Finding Existing Measures

This activity will help you navigate measurement databases and learn to find questionnaires that are of interest to you.

Start by visiting PsyToolkit (https://www.psytoolkit.org/survey-library/). Navigate to the library of questionnaires and find a topic you are interested in. There are hundreds of options like social media use, romantic jealousy, sleep quality, political attitudes, or how connected people feel to nature.

Once you have a topic, spend 15 minutes or so browsing the available measures. When you "run the demo" associated with a measure, you will be able to read through the items. As you explore measures, note which aspects of your topic each scale measures, and which scales seem most useful for the research you may want to do.

📝

Research Portfolio

Portfolio Entry #9: Reporting on the Measure You Found

After you have found a measure and read about it, paste a reference to the instrument in your research portfolio. Then, describe the topic you explored and what kind of measures you found. How did you settle on the instrument you chose? What does it measure, how many items does it contain, and what kind of scale do participants use to respond?

After describing your measure, write a few sentences explaining how you could use this instrument in a study. How would factors like scale length or response format affect your research?

Creating New Measures with AI

Existing measurement instruments are an effective way to measure what you are studying. There are times, however, when this won't work. The measure you need might not exist, or the ones you find might not suit your research needs. For example, an existing instrument might be too long, it might fail to capture the specific aspects of the construct you are interested in, or it might use outdated language. When you encounter these challenges, it is time to create your own measure.

Creating a measure requires decisions that shape how your scale works. Three of these decisions are: how many questions to ask, what kind of questions to ask, and how to label the response options.

How Many Questions?

The designer of every measurement instrument must decide how many items to include. While it might seem simple to ask a single question, we have already seen the peril in that approach—remember the example that opened this chapter. To accurately measure a construct, researchers create measures with multiple items. We have seen this approach with the TIPI, which uses two items per personality trait, and the GAD-7, which uses seven items to assess anxiety.

Using multiple items allows researchers to capture different aspects of the construct being measured. In the case of anxiety, one person might experience physical symptoms like restlessness while another experiences psychological symptoms like excessive worry. A scale is more likely to capture these different experiences when it uses multiple items.

Determining the optimal number of items for a scale requires balancing competing priorities. If a scale has too few items, it may fail to capture important aspects of the construct. If, on the other hand, the scale has too many items it is harder for participants to complete.

The scale you will create should have five to ten items. This range typically provides enough coverage to capture the essential aspects of the construct you are interested in while remaining practical to administer and validate. As you progress in your research career, you will encounter situations that call for both shorter and longer measures, but five to ten items is a solid starting point for scale development in this course.

What Kind of Questions?

As important as the number of items, creating a measurement scale requires you to choose which type of questions to use. This decision affects how well the scale will capture the construct being studied.

Table 4.2 lists the main question types used in measurement scales. Some are simple, like yes/no questions. Others use frequency scales that measure how often something occurs, or rating scales that capture degrees of intensity.

Among the options, Likert (pronounced "lick-ert") items have become especially popular in behavioral science. Developed by psychologist Rensis Likert in the 1930s, these items present statements that participants rate their agreement with. What makes Likert items especially useful is that nearly any question can be transformed into a Likert item while maintaining its core meaning. This means researchers can write multiple questions that maintain a consistent response format, making the scale easier to create and easier to administer.

Question Type Description Example
Yes/No Simple binary choice "Have you ever had a flu shot?"
Multiple Choice One answer from several options "What is your primary mode of transportation?"
[Car Bus Train Bicycle Walking]
Rating Scale Numerical rating of intensity "On a scale of 1-10, how satisfied are you with your job?"
Frequency Scale How often something occurs "How often do you exercise?"
[Never, Rarely, Sometimes, Often, Very Often]
Agreement Scale (Likert) Agreement with statements "Exercise helps me feel better"
[Strongly Disagree, Disagree, Agree, Strongly Agree]
Table 4.2. Types of questions commonly found in research studies.

When constructing your scale, we recommend starting with Likert items. They balance flexibility and simplicity, and they work well for measuring attitudes, beliefs, and experiences. As you draft items, try writing each one as a clear statement that participants can agree or disagree with.

Which Response Options?

After deciding which kind of questions to use, you must decide how participants will respond to them. This involves both the number of response options and how they are labeled.

Typically, each response option is assigned a number. We have encountered a few different response formats in this book. In the TIPI, for instance, the scale looked like this:

TIPI 7-point response scale showing: Strongly Disagree (1), Disagree (2), Somewhat Disagree (3), Neither Agree nor Disagree (4), Somewhat Agree (5), Agree (6), Strongly Agree (7)

In the GAD-7, by contrast, participants were given just four response options:

GAD-7 4-point response scale showing: Not at all (0), Several days (1), More than half the days (2), Nearly every day (3)

These numerical assignments allow researchers to calculate scores by adding or averaging responses across items. For instance, on the GAD-7, someone who responds "More than half the days" to four items and "Several days" to three items would receive a total score of 11.

When choosing response options for your scale, consider two factors. First, how precise does the measure need to be? More response options allow for finer distinctions but may overwhelm respondents. Second, do you want a neutral middle point? Five- and seven-point scales include a neutral option, while even-numbered scales force respondents to lean one way or the other. We recommend five response options for the Likert items you use in your scale.

Strategies for Writing Strong Scale Items

Beyond the decisions above, creating effective scale items requires paying attention to how each item is written.

First, each item should measure one thing. Double-barreled statements combine multiple concepts and they make a mess of measurement. For instance, "I study regularly and get good grades," asks students about both study habits and academic performance. What if someone studies regularly but gets poor grades? How are they supposed to answer? Replace double-barreled statements with separate items: "I study regularly" and "I typically get good grades."

Second, use straightforward language to avoid double negatives. Instead of writing "I am not uncomfortable speaking in public," opt for clear statements like "I feel comfortable speaking in public" or "Public speaking makes me nervous."

Third, it is important to include reverse-scored items—statements written in the opposite direction of what you are measuring. Reverse-scored items can identify participants who are not reading carefully. For example, the TIPI measures extraversion with both "Extraverted, enthusiastic" and "Reserved, quiet." When calculating scores, responses to reverse-scored items are flipped before being combined with other items.

Chapter 13 describes more advanced issues relating to questionnaire design within the context of implementing online surveys. The information above will get you started creating your own measures, which we show how to do with the help of AI in the next activity.

📊

Research Activity 4.3: Designing Your Own Measure

You are ready to create your own measurement scale. To streamline the effort, you will use AI as your research assistant.

The first, and perhaps hardest step is deciding what to measure. Don't worry about whether a scale already exists for your chosen topic. Just choose a psychological or behavioral characteristic that you are interested in and follow the steps below. You should read the example and tips for using AI before working on your measure.

Remember, the goal of this activity is for you to create a scale with about seven Likert items on a topic that interests you. In future chapters, you will have the option to validate the measure or use it in independent studies. You should be able to create a scale in 20 to 30 minutes.

Using AI to Generate Scale Items

Creating good items would traditionally take several weeks, if not months, doing things like brainstorming, consulting with colleagues, reading research, gathering data, and refining the questions through multiple drafts. While this approach will always be valuable, artificial intelligence tools like ChatGPT, Claude, Google's Gemini, and others have opened new avenues for scale development. AI can generate initial items, explore different ways of asking about the construct, and identify elements researchers might have overlooked.

The key is to think of AI as a collaborator in scale development. Just as you might bounce ideas off a friend or an expert, you can use AI to generate items and critically evaluate them. AI does not replace your role in the process. But it provides a starting point you can refine based on your understanding of the construct and your knowledge of scale development. Let's look at an example.

Imagine you want to create a scale that measures academic stress. You could spend hours thinking of how stress manifests in academic settings or you could ask AI to generate some items and then use your judgment to evaluate which ones capture the construct well, which need revision, and what the AI might have missed.

Much of your success or failure hinges on knowing how to effectively "prompt" the AI. In other words, how to ask for what you want in a way that generates useful responses. Vague requests like, "Give me some items for a measure of academic stress" will produce disappointing results. But, a detailed prompt that specifies what you are measuring and how you want to measure it can generate surprisingly useful suggestions.

Here is an example prompt: "I am creating a scale to measure academic stress among college students. Generate 10 potential Likert items that capture different aspects of academic stress. Each item should be clear, specific, and follow best practices for scale development (e.g., no double-barreled questions, no double negatives). The items will use a 5-point response scale from Strongly Disagree to Strongly Agree."

When we put this prompt into Claude, it produced the items in Table 4.3. Take a minute to review the items. Jot down your thoughts. Are there elements of academic stress you think are missing? Are the questions flawed? Is there overlap that could be eliminated? After you think about the items, we will show you how to evaluate them with AI.

Item Strongly Disagree Disagree Neutral Agree Strongly Agree
The amount of coursework I have feels overwhelming
I struggle to complete all my assignments on time
I worry that my grades aren't good enough
The pressure to perform well in school interferes with my sleep
I struggle to understand the course material
Thinking about my future academic goals makes me anxious
I find myself constantly rushing to meet deadlines
My schoolwork prevents me from getting enough rest
My academic workload leaves me feeling exhausted
I feel stressed about managing my academic responsibilities
Table 4.3. Items to measure academic stress, produced by Claude in response to an effective prompt.

Evaluating Scale Items with AI

Once you have an initial set of items, you can use AI to evaluate them. For example, here is an effective prompt we gave Claude to evaluate the items in our academic stress scale:

"You are an expert in scale development. Please evaluate the items on this academic stress scale, looking for potential problems like double-barreled questions, unclear wording, redundancy, or any other problems that an expert would look for. For each item that needs improvement, explain the issue and suggest a revision."

In response, the AI pointed out several potential problems with the items and suggested improvements. For instance, it pointed out that the fourth item—The pressure to perform well in school interferes with my sleep—and the eighth item—My schoolwork prevents me from getting enough rest—were redundant. It suggested removing one of them. The AI also suggested that item ten—I feel stressed about managing my academic responsibilities—was too general and overlapped with other items. If you want to see how AI might analyze this scale, you can enter the prompt above into ChatGPT or Claude along with the ten-item scale and examine the full output. Later, you can repeat this step with your items.

Assembling Your Scale

It is your turn to create a measurement scale. Using the topic you chose, follow the process we demonstrated with the academic stress scale. Start by asking AI to generate about 10 items, being as specific in your prompt as we were in the example. Remember to specify that you want Likert items and the response scale the items should use.

Once you have an initial pool of items, use AI to evaluate them. Look at the AI's feedback but remember to trust your judgment too—you may notice issues the AI missed or disagree with some of its suggestions. Don't be afraid to work through multiple iterations, to change the prompts, and to explore your own approach.

Your final scale should include approximately seven items (plus or minus one or two). Seven items typically provides enough coverage to measure your construct while keeping the scale manageable for participants.

In the next section, we will explore how to evaluate your newly created scale for reliability and validity—key steps to make sure a measure works as intended.

📝

Research Portfolio

Portfolio Entry #10: Reporting on Your Measure

After creating your instrument, report on your work in your research portfolio. First, paste the items from your instrument into your portfolio along with the answer scale. In a few sentences, describe the construct you are measuring. Then, reflect on the process of creating the instrument. Write a few sentences describing the criteria you used to retain or reject items and how you decided which answer scale to use.

Next, describe the feedback you received from the AI. Share an example of how you used the AI's feedback to improve an item and explain your reasoning for accepting or rejecting specific AI suggestions. How did this process deepen your understanding of scale development?

Module 4.3

Reliability and Validity

Examine how researchers evaluate the quality of their measures

After creating a scale, researchers must test whether it works as intended.

Think about the academic stress scale we developed. Before using it in research, we need to answer two fundamental questions: Does it measure academic stress consistently? And does it measure academic stress accurately? The first question concerns the reliability of the measure; the second question concerns validity. Let's use a simple analogy to make both concepts clear.

Imagine you are measuring your temperature with a thermometer. A reliable thermometer will give you the same reading as long as your temperature has not changed. If you measure yourself twice in a row, you should get the same number. That's reliability (i.e., a consistent score). But the thermometer also needs to be valid—it needs to actually measure temperature, not humidity, not air pressure, and not how long you have held the thermometer under your tongue. That's validity (i.e., an accuracy score). Psychological measures must be both reliable and valid.

Figure 4.3 illustrates these concepts. If you think of the bullseye as the true value of the characteristic being measured, then the location of the dots represents different combinations of reliability and validity. When the dots are clustered together but far from the bullseye, as in the first target, the observations are reliable but not valid. A measure like this yields similar scores each time it is used, but it does not accurately assess the construct.

In the second target, the observations are neither reliable nor valid. When a measure is inconsistent, it sometimes captures a construct accurately but at other times it is off. Therefore, when a measure lacks reliability it cannot be valid. Reliability, in other words, is a precursor to validity. You cannot have a measure that is valid but not reliable.

Finally, in the third target, the observations are both reliable and valid. A measure with both qualities consistently measures the construct it is supposed to measure.

Three target diagrams illustrating reliability and validity. Left target shows dots clustered together but off-center (reliable but not valid). Middle target shows dots scattered across the target (neither reliable nor valid). Right target shows dots clustered at the center (both reliable and valid).
Figure 4.3. An illustration showing the relationship between reliability and validity in measurements.

An Example: Establishing Reliability and Validity of the GAD-7 Anxiety Scale

Let's look at a real example of how to establish reliability and validity. When developing the GAD-7, researchers needed to show it could accurately identify people with clinical levels of anxiety. So, they collected data from nearly 3,000 people at primary care clinics in the United States (e.g., Spitzer et al., 2006). During the following week, about 1,000 of these people had a telephone interview with a mental health professional. Based on these interviews, each person was diagnosed as either having clinical anxiety or not. This diagnosis created a "gold standard" of accuracy to test against the GAD-7.

The researchers then examined how well GAD-7 scores predicted people's clinical diagnoses. They found that people diagnosed with clinical anxiety by a mental health professional scored 10 or higher on the GAD-7, while those without anxiety typically scored below 10. Thus, a score of 10 proved to be an optimal cutoff point that distinguished between people who did and did not have clinical anxiety.

But the researchers did not stop with this one assessment of validity. They also examined whether GAD-7 scores correlated with other indicators of anxiety. For example, they found that higher scores on the GAD-7 predicted more disability days, when anxiety prevented people from normal activities. Higher scores also predicted more visits with doctors and mental health professionals. People scoring higher on the GAD-7 reported greater difficulty with social relationships and more problems at work. These correlations provided additional evidence of validity because the measure predicted outcomes that a good measure of anxiety should predict.

Finally, the researchers assessed the measure's reliability. They contacted the participants one week after the initial study and asked them to complete the measure again. They found that people who had high scores on the first test also had high scores on the second test, and in contrast, people who had low scores on the first test also had low scores the second time. This is called test-retest reliability. Finally, the researchers examined how consistently each person responded to different items in the measure. As we will learn about next, this is a measure of the scale's internal reliability, or how well all the items in the instrument assess the same construct.

Types of Reliability

When testing if a measure works consistently, researchers examine different types of reliability. All reliability analyses rely on correlation, a statistical tool that measures how strongly two things are related.

Correlation coefficients range from -1 to +1, with +1 indicating a perfect positive relationship, 0 indicating no relationship, and -1 indicating a perfect negative relationship. Within the context of reliability, researchers expect positive correlations above .70.

Researchers use different techniques to verify that measures work consistently. Sometimes researchers want to know if all the items in a measure are assessing the same construct, as we just discussed with the GAD-7. In other situations, they want to know if two people rating the same behavior are assigning similar scores. There are many ways to check reliability, each suited to different types of measures and research situations (Figure 4.4).

Diagram showing four types of reliability: Internal Consistency (correlations among items within a scale), Split-Half (correlation between two halves of a long test), Inter-Rater (agreement between different observers), and Test-Retest (stability of scores over time).
Figure 4.4. Different measures of reliability are used to evaluate different kinds of data.

Internal Consistency

Let's start with the most common type of reliability: internal consistency. Internal consistency examines whether different items in the same scale correlate with each other. Theoretically, all the items should be correlated if they assess the same construct.

Figure 4.5 shows what internal consistency assesses. The curved, gray lines represent all possible correlations between the four items in a hypothetical scale. Meanwhile, Table 4.4 depicts the strength of each correlation. For example, Item 1 correlates with Item 2 at .65. Item 1 is correlated with Item 3 at .58, and with Item 4 at .60. In a reliable scale, the correlations should be in this ballpark or higher. If the correlations were below .30, it would suggest the items are not measuring the same thing.

Diagram showing four items connected by curved lines representing correlations between each pair of items, illustrating how internal consistency assesses correlations among all items in a scale.
Figure 4.5. Internal consistency assesses how strongly each item within a scale is correlated with every other item.
Item Item 1 Item 2 Item 3 Item 4
Item 1 1.00
Item 2 .65 1.00
Item 3 .58 .63 1.00
Item 4 .60 .57 .61 1.00
Table 4.4. The inter-item correlations should all be moderately strong and positive in a scale where all items are measuring the same construct.

While it is useful to examine the individual correlations between items, a more meaningful metric is a single number that captures how well all the items work together. This measure is called Cronbach's alpha.

Cronbach's alpha is a statistic that ranges from 0 to 1, with higher values indicating better internal consistency. Most researchers use the conventions in Table 4.5 to interpret Cronbach's alpha.

Cronbach's Alpha Interpretation
Above .90 Excellent
.80 to .90 Good
.70 to .80 Acceptable
Below .70 Needs improvement
Table 4.5. Cronbach's alpha scores above .70 indicate generally acceptable internal consistency (see George & Mallery, 2003).

In our example, the alpha is equal to .86, indicating good internal consistency. Thus, while the individual correlations between items were moderate (averaging .61), together the items created a reliable scale where we can be reasonably confident that each item measures the same underlying construct.

Split-Half Reliability

When a scale or test is longer than about 30 items, researchers often examine split-half reliability rather than internal consistency. Split-half reliability involves dividing a measure into halves and checking how well both halves correlate.

Split-half reliability is especially common among some long personality inventories. The Minnesota Multiphasic Personality Inventory (MMPI), for example, contains over 500 true/false items. Comparing people's scores between halves helps confirm that personality is being measured and that things like the test's length or fatigue among participants is not undermining the measure.

Inter-Rater Reliability

Some measures in the behavioral sciences require trained observers to rate people's behavior rather than self-reports. For instance, researchers might assess children's social skills by having observers rate playground interactions. Or, they might evaluate leadership qualities by having experts score recorded presentations. In these situations, inter-rater reliability indicates whether different observers assign similar ratings to the same behaviors.

To verify inter-rater reliability, researchers calculate a correlation between the observers' ratings. Strong positive correlations (typically above .70) suggest that the observers are applying the rating criteria consistently.

Test-Retest Reliability

While each form of reliability discussed so far applies to a specific situation, researchers are interested in test-retest reliability with every measure they use. That is because test-retest reliability examines whether a measure gives consistent results over time.

To assess test-retest reliability, researchers give people the same measures on two different occasions. The time between measurements should be long enough that people do not simply remember and repeat their previous answers, but short enough that the underlying characteristic has not changed. For example, if we gave the academic stress scale to students twice during the same semester (avoiding exam periods when stress levels might change), we would expect scores to correlate. A correlation of .80 or higher between the time points suggests good test-retest reliability.

Types of Validity

While reliability assesses if a scale measures something consistently, validity assesses if it measures what it is intended to measure. At first glance, validity might seem difficult to establish. How can researchers really know if questions about feeling nervous or worried measure anxiety? Or if questions about feeling overwhelmed really capture academic stress? The answer, as we saw with the GAD-7, lies in prediction.

If researchers have a valid measure of anxiety, then people who score high on the scale should show other signs of anxiety. In the study validating the GAD-7 this included a mental health diagnosis, visits to the doctor, and days when anxiety kept people from completing normal activities. Similarly, a valid measure of academic stress should predict relevant experiences like lower grades, more visits to academic support services, or higher dropout.

Just as researchers use correlations to establish reliability, they use correlations to test for validity. For instance, a valid anxiety measure should correlate with other indicators of anxiety. The stronger the correlations, the more confident the researcher can be that the scale measures what it is supposed to measure. There are two main approaches to validity: construct validity and criterion validity.

Construct Validity

Construct validity examines whether a scale correlates with other measures in expected and theoretically meaningful ways. This involves two kinds of predictions. First, a valid scale should correlate strongly with other established measures of the same construct. For example, the GAD-7 correlates strongly with other well-validated measures of anxiety, demonstrating what behavioral scientists call convergent validity—the scale converges, or aligns, with similar measures of the same underlying construct.

The second kind of prediction is that a scale should not correlate (or only be weakly correlated) with measures of unrelated constructs. This is called discriminant validity, a scale's ability to discriminate between what it is supposed to measure and other characteristics. For example, a valid measure of conscientiousness should show little correlation with agreeableness, even though both are socially desirable traits. Whereas conscientiousness reflects self-discipline and organization, agreeableness involves cooperation and concern for others. A person could be highly organized but competitive or very kind but disorganized. Similarly, a valid measure of self-esteem should show little correlation with narcissism. Even though both constructs involve positive self-regard, self-esteem reflects genuine self-worth while narcissism involves grandiosity and exploitation of others. These theoretically weak or absent correlations help confirm the scale is measuring its intended construct rather than unrelated constructs.

Criterion Validity

While construct validity assesses theoretically meaningful relationships, criterion validity focuses on predicting real-world outcomes. When the predictions involve current outcomes, researchers call it concurrent validity; when the predictions involve future outcomes, they call it predictive validity.

When developing the GAD-7, researchers examined whether people's scores predicted a clinical diagnosis of anxiety. They also looked at whether GAD-7 scores predicted days missed at work or school and healthcare visits (e.g., Spitzer et al., 2006).

A strong measure has both criterion and construct validity. It correlates with related measures in meaningful ways (construct validity) and it predicts relevant, real-world outcomes (criterion validity).

Content and Face Validity: Exceptions to Prediction

Not all types of validity rely on prediction. Behavioral scientists also assess both content validity and face validity, which depend on an examination of the scale items themselves rather than their relationships with other measures.

Content validity asks whether a scale covers all important aspects of the construct it is supposed to measure. For example, there are several elements to anxiety. Some people experience anxiety as physical symptoms while others experience it more as subjective feelings. A scale that only measures one of these elements would have bad content validity. The GAD-7, on the other hand, has good content validity because its items were carefully chosen to capture different manifestations of anxiety, from psychological symptoms like worry to physical symptoms like restlessness.

Face validity simply refers to whether the items appear to measure what they claim to measure. While this might seem trivial from a scientific standpoint, it has practical implications. Participants might not take the scale seriously if it does not look like a serious measurement tool. Face validity is also important when consulting experts. For example, an expert in the treatment of anxiety might immediately see that a scale is missing key elements in which case it would have low face validity.

Overall, developing a reliable and valid measure requires careful attention to all the components introduced above. For research to be valuable, researchers must ensure their scale produces consistent results (reliability) while actually measuring the intended construct (validity), and only measures that meet these high standards should be trusted for research.

Module 4.4

Scales of Measurement

Understand the different types of measurement scales and their properties

Think about the questions we have used in research so far. When collecting demographic information, we asked about gender (male, female, non-binary, other) and ethnicity—categories without any numeric values. With the GAD-7, people rated how often they experience anxiety symptoms on a scale ranging from "Not at all" to "Nearly every day." When we asked about age, people gave answers on a continuous scale (0-100). These examples show how questions exist on different measurement scales. Understanding these scales is important for analyzing the data from any behavioral study.

There are four types of measurement scales in behavioral science. You can remember them with the acronym "NOIR": Nominal, Ordinal, Interval, and Ratio. Each scale has different properties that determine the comparisons and calculations that can be made with the data. Table 4.6 provides an overview.

Scale Properties Examples from Our Projects Other Examples
Nominal Categories only Gender identity (male, female, non-binary) Major field of study
Ordinal Order matters Educational attainment (High school degree, Associates, Bachelors, Masters, PhD) Course grades (A, B, C, D, F)
Interval Equal intervals Total GAD-7 score (0-21) WAIS IQ scores
Ratio True zero point Response time in milliseconds Age, Height

Table 4.6. Types of measurement

Nominal Measurement

The most basic type of measurement puts things into categories. When people select their major field of study from a list or indicate their gender identity, it is an example of nominal measurement. The numbers or labels assigned to these categories have no mathematical meaning—there is no sense in which "male = 1" is less than "female = 2" or that "psychology = 1" is less than "sociology = 2". All a nominal measurement says is whether things are the same or different.

Ordinal Measurement

Most questionnaire items use ordinal measurement. Any time participants rate their agreement from "Strongly Disagree" to "Strongly Agree," or rate frequency from "Never" to "Always," they are using ordinal scales. These Likert-type items are ordinal because while there's a clear ordering to the responses, researchers cannot assume the differences between responses are equal.

Consider a question from the TIPI asking people to rate their agreement with the statement "I am outgoing, sociable" using options from "Strongly Disagree" to "Strongly Agree." While there is a clear ordering to the answers—"Agree" indicates more sociability than "Disagree,"—we cannot assume the differences between points are equal. The psychological distance between "Strongly Disagree" and "Disagree" might be different than the distance between "Disagree" and "Neutral," or between "Agree" and "Strongly Agree." The only thing we know for sure is that each response represents more agreement than the last.

Interval Measurement

A step above ordinal measurement, interval measurement represents a level of measurement where the differences between values are meaningful and consistent. With interval scales, not only can we rank order things (like with ordinal scales), but we can also say the distance between any two consecutive points on the scale is the same. This consistency allows us to perform mathematical operations like addition and subtraction. It also allows us to calculate meaningful averages.

A classic example of interval measurement in psychology is the Wechsler Adult Intelligence Scale (WAIS), which assesses IQ. With WAIS IQ scores, we can make meaningful comparisons between people: someone with an IQ of 130 is 30 points higher than someone with an IQ of 100, and this 30-point difference is the same as the difference between scores of 85 and 115. Again, with interval measurement, the intervals between scores are equal and meaningful.

Interval measurement allows behavioral scientists to perform mathematical operations and statistical analyses. We can calculate averages (the mean IQ is 100), examine how far someone deviates from the average, and compare groups to see if they differ significantly in intelligence.

While interval measurements allow behavioral scientists to make comparisons, the scale lacks a true zero point. An IQ score of zero doesn't mean a complete absence of intelligence; in fact, the WAIS doesn't even have a score of zero. The lowest possible score is around 40, and even this doesn't represent "no intelligence" but rather the lower bound of what the test can measure. The lack of a true zero point is what distinguishes interval scales from the final type of measurement: ratio scales.

Ratio Measurement

Ratio scales have a true zero point which allows researchers to know not only that Person A is 5 points higher on a scale than Person B, but also that one person's score is two or three times higher than another person's score. For example, someone who is eight feet tall is twice as tall as someone who is 4 feet tall. This makes height a ratio scale because it has a true zero point. However, IQ tests lack a true zero point—there is no such thing as an IQ of zero. Therefore, while it is possible to say that someone with an IQ of 100 has a higher score than someone with a score of 50, we cannot say the person is twice as intelligent.

Most measurements in the behavioral sciences exist on an interval scale, not ratio. However, because interval and ratio scales allow the same statistical tests, researchers seldom distinguish between these two scales.

Why These Scale Types Matter in Research

Understanding measurement scales is important because it affects how you can analyze and visualize your findings in SPSS and other statistical software.

Think about the data we have seen so far. When we measured anxiety using the GAD-7, the result was numerical scores that ranged from 0 to 21. But when we measured gender identity, the result was data indicating different categories. These measurements are handled differently when analyzing data.

Imagine you are trying to show patterns in your research findings. For interval or ratio data, like GAD-7 scores or age, histograms work beautifully. Remember the histogram we created to show the distribution of anxiety scores (see Figure 4.2)? It revealed that most people had relatively low anxiety while a few people reported high levels of anxiety. This pattern can only be seen with interval or ratio data.

But what if you tried to make a histogram of gender identity or college majors? It would not make sense because nominal categories don't have a numerical order. Instead, bar charts display nominal and ordinal data. A bar chart showing the percentage of participants in different majors, for instance, clearly displays how a sample is distributed across categories.

Even more important than visualization is choosing the right statistical techniques. When you want to understand relationships between variables—something we will explore in the next chapter—the measurement scale determines which statistical tests you can perform. For example, if you are curious about whether anxiety (interval) relates to age (ratio), you can use a correlation. But if you want to know whether gender (nominal) relates to college major (nominal), you need a different approach called a chi-square analysis.

Think of measurement scales like choosing the right tool for a job. Different types of data require different analytical tools. In future chapters, we will learn exactly which statistical techniques work best for different combinations of measurement scales. For now, the key is understanding that how variables are measured shapes how they can be analyzed.

📊

Research Activity 4.4: Classify and Work with Demographic Variables in SPSS

Virtually every research study includes demographic questions. These questions help researchers understand who participated in the research and can reveal interesting patterns in the data. But different demographic characteristics are measured on different scales, which affects how the data can be analyzed. Let's explore this using the dataset from earlier in this chapter.

For this exercise, we will work with the dataset we used to examine anxiety scores, but this time we will focus on the demographic information collected from participants. Open the "RITC_DATA_CH04_Measurment.sav" data file. You will notice that along with the clinical items, the file contains several demographic variables such as age, gender, and so on. Your task is to identify the measurement scale for each demographic variable.

In SPSS, variables can be classified as Nominal, Ordinal, or Scale (which combines interval and ratio measurement). Look at each demographic question and identify which scale it exists on. For example, gender is nominal—the categories have no inherent order. Education level is ordinal—"Graduate degree" represents more education than "Bachelor's degree." Age exists on a scale—you can meaningfully calculate averages and differences between ages. For each demographic variable, go into the SPSS variable view and label the variables according to their appropriate measurement scale. The instructional video will show you how to check and change measurement scales in SPSS or you can follow HOW TO Box 4.2.

Box 4.2

How to Change Measurement Scales in SPSS

These steps allow you to properly classify variables by their level of measurement.

Open the dataset

  • Open the "RITC_DATA_CH04_Measurement.sav" file.

Switch to Variable View

  • Click on the "Variable View" tab at the bottom of the data window
  • This displays all the properties of your variables in rows

Locate the Measure Column

  • Scroll horizontally and find the "Measure" column
  • This column displays the current measurement level of each variable

Change the Measurement Level

  • Click on the cell in the "Measure" column for the variable you want to modify
  • A dropdown arrow will appear when you click on the cell
  • Click the dropdown arrow and choose the appropriate level of measurement
    • Choose "Nominal" for categorical variables with no inherent order (e.g., gender)
    • Choose "Ordinal" for categorical variables with a meaningful order (e.g., education level)
    • Choose "Scale" for continuous variables with equal intervals (e.g., age)

Save your Changes

  • Save your dataset by clicking on "File" → "Save"
  • Your variables are now properly classified and ready for analyses
📝

Research Portfolio

Portfolio Entry #11: Working with Measurement Scales in SPSS

  1. In your portfolio, write down the demographic variables that were in the SPSS dataset and the measurement scale you assigned to each one.
  2. Were there any demographic variables for which you weren't sure which measurement scale to assign?
  3. Take a screenshot of the SPSS measurement columns that you worked with. Paste it into the portfolio.

Summary

In this chapter, we learned about the foundations of measurement in behavioral research. We began with an example showing how a poorly designed poll question about racial attitudes led to misinterpreted headlines and the spread of misinformation. This case illustrated three core principles of measurement: complex attitudes require multiple questions, how questions are asked matters as much as what is asked, and good measures must be tested for reliability and validity.

The activities throughout the chapter reinforced these principles. First, we learned how questionnaire instruments transform abstract psychological constructs into measurable variables. Through hands-on activities with clinical scales like the GAD-7, we calculated anxiety scores and visualized their distribution, seeing firsthand how researchers assess psychological phenomena through multiple items.

Next, we explored strategies for finding existing measures in specialized databases and creating new measures. You considered questions like how many items to include, which response format to use, and you practiced using AI to develop new measures.

The chapter emphasized two essential criteria for evaluating measurement tools: reliability (consistency) and validity (accuracy). Like a thermometer that gives consistent readings and measures temperature rather than humidity, good psychological measures must consistently assess what they claim to measure.

Finally, we examined the four scales of measurement—Nominal, Ordinal, Interval, and Ratio (NOIR)—understanding how measurement type determines appropriate analyses.

Going forward, remember that measurement is both art and science. It requires creativity to capture human experience through carefully crafted questions, and scientific rigor to make sure those questions work as intended. As we move into examining relationships between variables in the next chapter, keep in mind that the quality of measurements directly determines the quality of a study's conclusions. Good science begins with good measurement.

Frequently Asked Questions

What is the difference between reliability and validity in measurement?

Reliability refers to the consistency of a measure - whether it produces the same results under consistent conditions. Validity refers to whether a measure accurately captures what it is intended to measure. A measure must be reliable to be valid, but reliability alone does not guarantee validity. Think of it like a thermometer: reliability means it gives the same reading consistently, while validity means it actually measures temperature and not something else.

What are the four scales of measurement (NOIR)?

NOIR stands for the four scales of measurement: Nominal (categories only, like gender), Ordinal (ordered categories, like education level), Interval (equal intervals with no true zero, like IQ scores), and Ratio (equal intervals with a true zero, like height or age). Understanding these scales is important because they determine which statistical analyses are appropriate for your data.

How can AI tools help with creating measurement scales?

AI tools like ChatGPT and Claude can assist with scale development by generating initial items, exploring different ways to ask about a construct, and identifying elements researchers might have overlooked. The key is using detailed prompts that specify what you're measuring and how you want to measure it. AI serves as a collaborator in the process, but researchers must still critically evaluate and refine the generated items.

What is Cronbach's alpha and what values are acceptable?

Cronbach's alpha is a statistic that measures internal consistency - how well items in a scale correlate with each other. It ranges from 0 to 1, with higher values indicating better internal consistency. Generally, values above .90 are excellent, .80-.90 are good, .70-.80 are acceptable, and below .70 indicates the scale needs improvement.

Key Takeaways

  • Scale instruments are collections of carefully designed questions that work together to measure psychological constructs, transforming abstract experiences into quantifiable data.
  • Finding existing measures in databases like PsyToolkit and PsycTests is often preferable to creating new ones, as these measures have already been validated.
  • When creating new measures, use multiple items, follow best practices for item writing (avoid double-barreled questions and double negatives), and consider using AI as a collaborator.
  • Reliability refers to the consistency of a measure—including internal consistency, split-half, inter-rater, and test-retest reliability.
  • Validity refers to whether a measure accurately captures the intended construct—including construct validity (convergent and discriminant) and criterion validity (concurrent and predictive).
  • The four scales of measurement (NOIR: Nominal, Ordinal, Interval, Ratio) have different properties that determine which statistical analyses are appropriate.
  • Good measurement is fundamental to all behavioral research—without reliable and valid measures, research findings cannot be trusted.