Research in the Cloud Textbook, forthcoming with Cambridge University Press

Reliability and Validity: What Thomas Edison's Famous Failure Teaches Us About Measurement

Aaron Moss, PhDApril 16, 20267 min read

Thomas Edison's famous employment test failure illustrating the importance of reliability and validity in measurement

In this post:

Why Thomas Edison’s 1921 employment test became a public controversy and what it reveals about a fundamental mistake still made today

The difference between reliability and validity in measurement

A modern example of the same measurement error, drawn from a widely misread 2023 opinion poll on race

What good measurement really looks like, using a clinically validated anxiety scale as a model

Three core principles that separate rigorous measurement from guesswork and how to apply them in behavioral research

In 1921, Thomas Edison created an employment test. Looking to replace aging executives with new graduates, Edison compiled a list of questions he thought people joining his organization ought to be able to answer. Among the items were:

“Who invented logarithms?”
“How is leather tanned?”
“Name two locks on the Panama Canal”
“What is the first line of the Aeneid?”
“What is the lightest wood?”
“How fast does sound travel per foot per second?”, and
“What is the weight of air in a room 20 × 30 × 10?”

If those questions have you scratching your head, you’re not alone. The test was a fiasco.

Of the 700 or so people who took Edison’s test, just 4% passed. When a disgruntled applicant leaked questions to The New York Times, the test became a controversy. Reporters asked other public intellectuals, including Nikola Tesla and Albert Einstein, for their thoughts. Einstein reportedly said: “The value of a college education is not the learning of many facts but the training of the mind to think.” Harper’s Magazine went further and accused Edison of taking pleasure in exposing the ignorance of other people.

Although harsh, the criticism missed a deeper problem: Mr. Edison didn’t understand measurement. Edison claimed his test measured “curiosity” and “executive quality.” However, there’s no evidence he evaluated the test’s ability to predict job performance, leadership ability, or innovative thinking—the very qualities that made Edison himself successful.

Within the social and behavioral sciences, good measurement requires both reliability and validity. Measures that lack these qualities can lead to mistakes like the one Edison made and like the following example demonstrates.

Reliability and Validity: The Two Pillars of Measurement
Reliability refers to the consistency of a measure. A reliable instrument produces the same results under the same conditions. If you measure something twice and get wildly different answers, the measure isn’t reliable.
Validity refers to the accuracy of a measure. A valid instrument actually measures what it claims to measure. A scale might reliably give you the same number every time, but if that number doesn’t reflect your actual weight, it’s not valid.

The Measurement Problem Edison Didn’t See

Thomas Edison’s mistake wasn’t asking unusual questions. It was assuming his questions automatically measured the things he was interested in. This assumption still trips people up, over a century later.

Consider an example from February 2023. The polling firm Rasmussen Reports released a survey showing that only 53% of Black Americans agreed with the statement “It’s okay to be White.” The poll spread rapidly, fueling debate about racial divisions in America. A popular cartoonist also used the poll to make inflammatory statements that ended his career.

But there was a problem: the survey question didn’t measure what it claimed to measure.

When researchers conducted a follow-up study, they discovered that people who disagreed with the statement weren’t expressing negative attitudes toward White people. Instead, they were confused by the question’s meaning or responding to its use as a political slogan. When the question was rewritten to be clearer, the supposed racial divisions disappeared. As one respondent wrote, “Color should not matter in this day and age—we are all the same inside.”

The same error that haunted Edison in 1921 plagued pollsters in 2023: assuming a question measures what you think it measures without actually testing that assumption.

What Good Measurement Looks Like: Reliability and Validity

So, how do behavioral scientists avoid measurement mistakes? The answer lies in the two concepts introduced earlier: reliability and validity. Reliability and validity are the foundation of measurement.

Reliability: Consistency of Measurement

Perhaps the easiest way to understand the concepts of reliability and validity is to think of a thermometer.

A reliable thermometer gives you the same reading every time you use it, as long as your temperature hasn’t changed. If you measure yourself twice in a row, you should get the same number.

Validity: Accuracy of Measurement

But the thermometer also needs to be valid. That is, it needs to actually measure temperature, not humidity, not air pressure, and not something else. Validity is about accuracy.

Edison’s test failed on both counts. He never checked whether people who scored similarly had consistent results (reliability), and he never verified that his questions actually predicted “curiosity” or “executive quality” (validity).

Modern instruments used in social and behavioral research take a different approach. Consider the GAD-7, a widely used measure of anxiety. When researchers developed this scale, they didn’t just write questions and assume they worked. They collected data from nearly 3,000 people and had mental health professionals conduct independent diagnostic interviews. They found that people diagnosed with clinical anxiety by professionals consistently scored above a certain threshold on the test, providing evidence that the measure captured what it claimed to capture.

But the researchers didn’t stop there. They also examined whether GAD-7 scores predicted real-world outcomes that a good anxiety measure should predict: days missed at work, doctor visits, relationship difficulties, and problems at work. These correlations provided additional evidence of validity.

Finally, the researchers assessed reliability by giving the same people the test twice, one week apart. People who scored high the first time also scored high the second time. People who scored low initially stayed low. The measure was reliable.

Three Principles That Would Have Saved Edison

Chapter 4 of Research in the Cloud: An Introduction to Modern Methods in Behavioral Science distills the lessons of measurement into three principles.

1. Single questions rarely capture complex constructs accurately.

Edison tried to measure “curiosity” and “executive quality” with esoteric trivia questions. Modern researchers know that complex psychological characteristics require multiple items that capture different facets of the construct. The GAD-7 uses seven questions because anxiety manifests in different ways: some people experience physical symptoms like restlessness, others experience psychological symptoms like excessive worry. A single question can’t capture this complexity.

2. How you ask a question matters as much as what you ask.

Most pollsters know that question wording matters. The “It’s okay to be White” poll failed because the question’s wording triggered confusion and political associations rather than measuring actual attitudes. Question design is both an art and a science—it requires creativity to capture human experience through carefully crafted language, and rigor to ensure those questions work as intended.

3. Good measures must be systematically tested for reliability and validity.

Edison never tested whether his questions consistently measured curiosity or leadership potential (reliability) or whether they predicted the outcomes he cared about (validity). He simply assumed his intuitions were correct. A century of behavioral science has taught us that intuition isn’t enough.

Mastering Measurement in Behavioral Research

The measurement chapter in Research in the Cloud doesn’t just explain the concepts above—it teaches you how to apply them. Through hands-on activities, you’ll work with real instruments like the GAD-7, calculating anxiety scores and visualizing their distribution across hundreds of participants. You’ll also learn to find existing measures in specialized databases and create your own measurement scales using AI tools to generate and refine items. Finally, you’ll test reliability and validity using the same techniques researchers use, and understand the four scales of measurement—Nominal, Ordinal, Interval, and Ratio—and why they matter for data analysis.

By the end of the chapter, you’ll have the skills to evaluate whether any measure—from a clinical anxiety scale to an employee satisfaction survey—actually captures what it claims to capture. And in that regard, you will be one step ahead of Thomas Edison. Read chapter 4 for free here.

This post is part of a series exploring the chapters of Research in the Cloud: An Introduction to Modern Methods in Behavioral Research by Aaron Moss, Jonathan Robinson, and Leib Litman.

Ready to learn the foundations of measurement?
Read Chapter 4 for free! Research in the Cloud teaches you how to build, evaluate, and apply reliable and valid measures in behavioral research.