A new study (summarized here; full text available here if you’re lucky enough to be at an institution with a medical school and an institutional subscription to the relevant journal) released yesterday purports to have some implications for sex education policy. By conducting what the authors — or the publicists at the University of Pennsylvania, at any rate — refer to as the first randomized controlled study of the effectiveness of various kinds of interventions, and finding that the abstinence-only intervention was more effective at encouraging teens to delay sexual activity than safe sex or abstinence-plus-safe-sex programs. The numbers aren’t overwhelming — 33.5% of adolescents reported having sex in the 24 months following their participation in the abstinence-only program, while 48.5% of the students in other programs reported having had sex during that period of time — but they look compelling.
At least, they look compelling at first glance. Despite the authors’ own admirable cautionary notes regarding the need for further research before any policy implications can be solidly grounded, the pundits seem to be lining up as expected, and deploying the results in a decontextualized manner: the researcher at the Heritage Foundation who wrote the federal guidelines for funding abstinence-only programs is (big shock here) pleased that the study validates what he always maintained, while the critics of abstinence-only (in an equally big shock) deny that the study validates the programs that they oppose. Politicized science, indeed.
The problem here is that both sides of the political discussion appear to fundamentally misunderstand the methodology involved in a study like this — and this misunderstanding permits them to drawn erroneous conclusions about what the results actually mean. This is a little more serious than “correlation is not causation,” although it begins there; in fact, the issue is more like “662 African American students from four public middle schools in a city in the Northeastern United States are not a laboratory.” As Nancy Cartwright (among many other philosophers of science) has pointed out, the fundamental error involved in the interpretation of randomized controlled trials (RCTs) is that people mis-read them as though they had taken place under controlled conditions, when they actually did not; in consequence, generalizing beyond the specific trial itself is a process fraught with the potential for error.
Consider, for a moment, what makes a laboratory trial “work” as a way of evaluating causal claims. If I want to figure out what chemical compound best promotes longevity in fruit flies or mice, the first thing I do is to make sure that my entire stock of experimental subjects is as similar to one another as possible on all factors that might even potentially affect the outcome (a procedure that requires me to draw on an existing stock of theoretical knowledge). Then I work very hard to exclude things from the laboratory environment that might affect the outcome — environmental factors, changes in general nutrition, etc. And when conducting the trials, I make sure that the procedures are as similar as humanly possible across the groups of experimental subjects, again drawing on existing theory to help me decide what variations are permissible and which are not. All of this precise control is made possible by the deliberately artificial environment of the laboratory itself, and at least in principle this precise control allows researchers to practically isolate causal factors and their impact on outcomes.
Now, the problem is that the actually-existing world is not a laboratory, but a much more open system of potential causal factors interacting and concatenating in a myriad of ways. Scientific realists like Cartwright bridge the gap between the laboratory and the world by presuming — and I stress that this is a theoretical presumption, not an empirical one — that the same causal factors that operated in the laboratory will continue to exert their effects in the open system of the actual world, but this most certainly does not mean that we will observe the kinds of robust correlations in the actual world that we were able to artificially induce in the laboratory. Hence, what is required is not a correlation analysis of actual empirical cases, but detailed attention to tracing how causal factors come together in particular ways to generate outcomes. (Sections 1.2 and 1.3 of this article provide a good, if somewhat technical, account of the conceptual background involved.)
So causal inference goes from the controlled laboratory to the open actually-existing world, and we can make that move precisely to the extent that we presume that objects in the lab are not fundamentally different from objects in the world. The problem with an RCT is that it turns this logic completely on its head, and seeks to isolate causal factors in the actual world instead of in the laboratory, and as evidence of causation it looks for precisely the kind of thing that we shouldn’t expect in an open system: namely, robust cross-case correlations. Following 662 students from four middle schools over a period of several years is in basically no significant respect anything like putting 662 mice on a variety of diets in a lab and seeing which groups live the longest; the number of potentially important factors that might be at work in the actual world is basically a countably infinite quantity, and we have precisely no way of knowing what they are — or of controlling for them. No lab, no particular epistemic warrant for correlations, even robust ones; they might be accidental, they might be epiphenomenal, heck, they might even be the unintentional result of sampling from the tail-end of a “crazy” (i.e., not a normal) distribution. All the technical tricks in the world can’t compensate for the basic conceptual problem, which is that unless we make some pretty heroic assumptions about the laboratory-like nature of the world, an RCT tells us very little, except for perhaps suggesting that we need to conduct additional research to flesh out the causal factors and processes that might have been at work in producing the observed correlation. In other words, we need better theory, not more robust correlations.
The limitations of RCTs can perhaps be even more clearly grasped if we think about the marvelous machine that is organized Major League Baseball: 30 teams playing 162 games each over the course of each six-month-long season, and doing so under pretty rigorously controlled conditions. Indeed, MLB is a kind of approximate social laboratory, where players are required to perform basically similar actions over and over again; pitchers throw thousands of pitches a season, batters can have hundreds of plate appearances, and so on. And over everything is a bureaucracy working to keep things homogeneous when it comes to the enforcement of rules. It’s not a perfect system — “park effects” on pitcher and batter performance are measurable, and sometimes players cheat — but on the whole it’s a lot closer to a closed system than four middle schools in the Northeastern United States. But even under such conditions, there are prediction failures of epic proportions, as when a team pays a great deal of money to acquire a player who previously performed very well (cough cough Yankees acquiring Randy Johnson cough cough) only to discover that some previously-unaccounted-for factor is now at work preventing their performance from reaching its previous heights. Or there are celebrated examples like the more of less complete collapse of a previously elite player like Chuck Knoblauch when moving from small-market Minnesota to huge-market New York — something that looked like a very robust correlation between a player and his performance turned out to be in part produced by something hitherto unknown. It works in reverse too: players who did badly someplace resuscitate their playing careers after signing with different teams, and there is precisely no perfect system for predicting which players will do that under which conditions.
My point is that if the laboratory-like environment of MLB doesn’t produce generally valid knowledge that can survive the transplantation of players from team to team — in effect, if the results of previous laboratory trials are at best imperfect predictors of future laboratory trials, and we can only determine in retrospect how good a player was by looking at his overall playing career statistics — what hope is there for an RCT study conducted under much more uncontrolled conditions? At least in baseball one can say that all of the trials take place under similar conditions, and given the absurdly large n that can be worked with if one aggregates performance data from multiple seasons of play, it is possible to develop probabilistic forecasting models that have some reasonable chance of success on average. But the practical condition of this kind of operation is the approximate closure of the environment produced by the organization of the MLB season; this is not merely a quantitative convention, but an actual set of social actions. In the absence of such practical conditions, a robust correlation counts for very little, and seems like a very thin reed on which to base public policy decisions.
Again, what we need is better theory, not better correlations. How do sex education programs work? What kind of processes and mechanisms are involved in educating an adolescent about sex, and how do those work together in the actual world to generate case-specific outcomes? That’s the basis on which we ought to be having this discussion — and, not incidentally, the discussion of every other public policy issue which we mistakenly refer to correlation studies conducted in the open system of the actual world as helping us to puzzle out. A robust correlation is neither necessary nor sufficient for a causal claim, and until we accept that, we will never avoid this kind of gross mis-use of scientific results for partisan purposes.
This is the kind of thing I try to teach people. Can I count on a sequel?