Validity in research: A primer, with a focus on clinical research
This blog post focuses on research and how we can judge whether the results of a research study may or may not be trustworthy. I talk a lot on this site and in my work about clinical studies that are “well-designed.” A well-designed study is one that can withstand multiple threats to internal and external validity. This post, then, is meant to be a reference for interested readers in understanding what I mean when I say a study may or may not be well-designed.
There are two main types of validity: Internal and External validity. Internal validity refers to the degree in which a study is methodologically sound. Can you believe the actual results of the study? External validity refers to how well your study generalizes outside of the research lab. Does what happened in your study actually happen in the real world?
Internal validity can be comprised by a number of things. I won’t review them all here, but some of the most common types in clinical research involve Maturation, Statistical Regression, Selection, Experimental Mortality, Testing, Instrumentation, and Design Contamination.
Maturation: Occurs when changes in the outcome variable (the change you’re interested in – for example, in a study exploring the effects of a new weightlifting program on muscle definition, muscle size may be the outcome) due to normal developmental processes over time. Lots of variables just change on their own over time. The easiest way to control for maturation is to have at least two groups of participants in your study and to have the groups be age-matched to mature at about the same rate. It also helps to have larger sample sizes (more participants) because the individual differences in maturation rate will cancel out. There are other ways to control for this without having >2 groups, by, for example, measuring maturation rate and then controlling for it statistically, but doing this is far more difficult than it sounds and involves very complex math that even most researchers have not studied.
Statistical Regression: Usually, to qualify for a treatment study, a person has to score high on initial symptom scales to show that they actually have symptoms that need to be treated. So, researchers usually look for people who score toward the extreme end of a scale, because by definition these are the people that need treatment. Statistical regression refers to the idea that, even without any treatment, people who initially score very high on a scale tend to move closer to the “average” score the next time they take it. This is called “regression to the mean.” So, if the average score on a test is 100, and I score 150 the first time I take it, my score is far more likely to go down (toward 100) than it is to go up (above 150). The same works on the other side. If I score 60, my score is more likely to go up than down the next time I take the test. In treatment studies, then, this can be a problem because participants can look like they are getting better just because of regression to the mean. In other words, in my depression is in the extremely high range at baseline (and it usually would be, to qualify for the study), it’s more likely to go down (showing I’m better) than stay the same or go up (showing I’m worse). As with many other validity threats, statistical regression can be controlled by having multiple, matched groups. It can also be controlled by using reliable tests (tests where you already know how much people vary from time to time as they take the test). Then, you can do some fancy math to see if your participants changed more than people usually change on that test.
Selection: If you have more than 1 group, you also have to specify how participants get selected into each group, and this process has to be completely fair. All participants must have an equal chance of being placed in one group or the other. Otherwise, a researcher could manipulate a study by, for example, placing all of the really sick people in one group and all the really healthy people in another.
Experimental mortality: This refers to how many participants are lost across groups (in this case, lost usually means people dropped out of the study on their own). If this rate isn’t the same across groups, then it can cause difficulty in interpreting results and can even threaten the validity of results if not addressed properly.
Testing: For many types of tests, especially those used in psychological intervention studies, just taking a test once can affect how you do on it the second time. For example, taking the test at baseline can make you pay attention to things (e.g. symptoms) you never paid attention to before taking the test. Sometimes, just noticing those symptoms makes you report them differently the next time. So, again, this can cause problems with your study. Testing effects can be controlled by using tests with high test-retest reliability and by having >2 groups in your study design, assuming both groups are taking the same tests at roughly the same time periods.
Instrumentation: Sometimes, a study can have problems in how accurately an instrument measures something over time. Frequently, machines that measure results (for example, a blood test) get glitchy and report erroneous results. This can be addressed by regularly calibrating instruments to make sure they are accurate. This is easy with machines. It’s much harder when a human is measuring the results. So, to control for instrumentation, you not only need reliable machines, you also need reliable humans, which is not impossible, but is much easier said than done, because most humans (compared to machines) are notoriously unreliable at measuring data.
Design contamination: This occurs when participants find out about each other or about the study. There are some interesting things that happen when participants find out info about a study. First, some participants tend to perform as they are expected to in a study. This is called the Rosenthal or Pygmalion effect. If a participant believes they are expected to get better, they often actually do get better on their own, like a placebo effect. Second, some participants engage in what’s called “compensatory rivalry.” They may find out (or suspect) for example that they are not getting “the good stuff” – the actual treatment (instead they’re getting the placebo)– in a study and they may then be motivated to work harder to get better and “outdo” the “privileged” group that is getting the “good stuff.” This can also be called the “John Henry” effect, named in honor of a steel driver who, in the early 20th century, worked so hard in a competition with a steam drill to outdo it that he died of overexertion (he did win, though, by the way, and actually outperformed the steam drill).
A third type of design contamination is the opposite of compensatory rivalry, called resentful demoralization, where a participant finds out they aren’t getting the “good stuff” and then become demoralized and perform worse. It’s the idea of “why should I even try if I’m not getting the real treatment.”
These (design contamination) are all really common problems in multi-group studies, especially when you consider that it doesn’t even matter what treatments the people are actually getting; it only matters what they believe they are getting. Still, you try to control for these by blinding participants to treatment (so they never know for sure what they got until AFTER the study is done), but often blinding is pretty difficult. There are also a lot of ethical complications involved in this too. Consider for example a placebo-controlled surgery. A “sham” knee surgery, for example, might be where the surgeon cuts open the knee (so there is a scar) but doesn’t actually do anything inside. While doing this makes good science (it’s the simplest to control for threats to validity), it involves really questionable ethics of putting someone through anesthesia and then maiming them a bit without actually making them better. Most doctors aren’t psychopaths and have no desire to offer “fake” treatments to patients, and so a lot of real world studies have real limitations on their ability to offer a proper “placebo” treatment in a scientifically sound way.
All of the above were threats to internal validity, but what about threats to external validity. A threat to external validity is basically anything that can cause a difference between those in the experiment and those NOT in the experiment. The most prominent example I run into in my field is that research studies usually have far more dedicated resources than are available in the real world. So, in a research study, a patient will usually get a very high standard of care and follow-up. They get extensive initial evaluation and multiple checkups by dedicated research personnel, some of whom may have the sole job of monitoring research participants. In the real world, though, doctors don’t always have the same time and resources. Instead of a 2-hour initial evaluation, a patient may get 30 minutes. Instead of weekly checkups, the patient may get monthly appointments. Now, this isn’t because real world doctors don’t care – they do. It’s just that they don’t have special research funds that pay for all of the extra time and resources necessary to do all that work.
Another difference is that the type of people that qualify for a research study are usually fairly “idealized” patients – they aren’t complicated and don’t have multiple, complex conditions to treat. Research patients are usually simple – researchers want them to be – otherwise, there is too much to confound the results of the study. But in the real world people are anything but simple, and real patients rarely present with only one, isolated problem. Often, the problems of real-world patients interact in meaningful ways, but researchers rarely capture those interactions, because they purposely eliminate them for simpler research.