Tag: methodology (page 2 of 3)

$h•! PTJ Says #1: justifying your theory and methodology

I am going to try writing down pieces of advice that I give to students all the time, in the hopes that they might be useful for people who can’t make it to my office hours.

“The fact that no one else has approached topic X with your particular perspective is not a sufficient warrant for approaching topic X with your particular combination of theory and methodology. In order to get the reader on board, you have to basically issue a promissory note with a grammar that runs something like:

‘Here’s something odd/striking/weird/counterintuitive about X. Other scholars who have talked about X either haven’t noticed this odd/striking/etc. thing at all, or they haven’t found it odd/striking/etc. Furthermore, they haven’t done so because of something really important about their theory/methodology that — even though it generates some insights — simply prevents them from appreciating how odd/striking/etc. this thing is, let alone trying to explain it. Fortunately, there’s my alternative, which I am now going to outline in a certain amount of abstract detail; but bear with me, because there’s a mess of empirical material about topic X coming after that, and I promise you that my theoretical/methodological apparatus will prove its worth in that empirical material by a) showing you just how odd/striking/etc. that thing is, and b) explaining it in a way that other scholars haven’t been able to and won’t be able to.’

Almost no one is convinced by theory and methodology, and absolutely no one is or should be convinced by a claim that existing approaches aren’t cool enough because they aren’t like yours. The burden is on you to give the reader reasons to keep reading, and at the end of the day the only reason for theory and methodology is to explain stuff that we didn’t have good explanations for before. So you have to convince the reader that other approaches *can’t* explain that odd thing about topic X. (And if you can do this without gratuitous and out-of-context references to Thomas Kuhn and being ‘puzzle-driven,’ that’s even better, because I won’t have to make you write an essay on why basically nobody in the social sciences actually uses Kuhn correctly.)”


On Paradigms, Policy Revelance and Other IR Myths

I had every intention this evening of writing a cynical commentary on all the hoopla surrounding Open Government, Open Data and the Great Transparency Revolution. But truth be told, I am brain-dead at the moment. Why? Because I spent the last two days down in Williambsurg, VA arbitrating codes for a Teaching, Research and International Politics (TRIP) project (co-led by myself and Jason Sharman) which analyzes what the field of IR looks like from the perspective of books. It is all meant as a complement to the innovative and hard work of Michael Tierney, Sue Peterson and the TRIP founders down at William & Mary, who have sought to map the field of IR by systematically coding all published articles in the top 12 peer-reviewed disciplinary journals for characteristics such as paradigm, methodology, epistemology and policy relevance. In addition, the TRIP team has conducted numerous surveys of IR scholars in the field, the latest round capturing nearly 3000 scholars in ten countries. The project, while not immune from nit-picky criticism about its methodological choices and conclusions, has yielded several surprisingly results that have both reified and dismantled several myths about the field of IR.

So, in the spirit of recent diatribes on the field offered by Steve and Brian, I summarize a few of the initial findings of our work to serve as fodder for our navel-gazing discussion:

Myth #1: IR is now dominated by quantitative work

Truth: Depends on where you look. This is somewhat true if you confine yourself to the idea that we can know the field only by peering into the pages of IO, ISQ, APSR and the like. Between 2000-2008, according to a TRIP study by Jordan et al (2009), 38.8% of journal articles employed quantitative methods,while 30.4% used qualitative methods. [In IPE, however, the trend is definitely clearer: in 2006, 90% of articles used quantitative methods — see Maliniak and Tierney 2009, 20)]. But the myth of quantitative dominance is dispelled when we look beyond journals. In the 2008 survey of IR scholars, 72% of scholars reported that they use qualitative methods as their primary methodology. In our initial study of books between 2000-2010, Jason and I found that 58% of books use qualitative methods and only 9.3% use quantitative (the rest using mainly descriptive methods, policy analysis and the rare formal model).

Myth #2: In IR, it’s all about PARADIGMS.

Truth: Well, not really. As much as we kvetch about how everyone has to pay homage to realism, liberalism, constructivism (and rarely, Marxism) in order to get published, the truth is that a minority of published IR work takes one or more of these paradigms as the chosen framework for analysis. Surveys reveal that IR scholars still think of Realism as the dominant paradigm, yet realism shows up as the paradigm of choice in less than 10% of both books and article. Liberalism is slightly more prevalent – it is the paradigm of choice in around 26% of journal articles and 20% of books. Constructivism has actually overtaken realism, but still amounts to only 11% of journal articles and 17% of books in the past decade. Instead, according to the TRIP coding scheme, most of the IR work is “non-paradigmatic” (meaning it takes theory seriously, but doesn’t use one of the usual paradigmatic suspects) or is “atheoretic”. [Stats alert: 45% of journal articles are non-paradigmatic and 9.5% atheoretic, whereas books are 31% non-paradigmatic and 23% are atheoretical).

So, Brian: does IR still “really like” the isms?

Myth #3: Positivism rules.

Truth: Yep, that one is pretty much on the mark. 86% of journal articles AND 85% of books between 2000-2010 employed a positivist methodology. Oddly, however, only 55% of IR scholars surveyed report to see themselves as positivists. I’m going to add that one to the list of “things that make me go hmmmmm…..”

Myth #4: IR scholarship is not oriented towards policy.

Truth: Sadly, true. Only 12% of journal articles offer policy recommendations. [Ok, a poor proxy, but all I had to go on from the TRIP coding system]. Books are slightly more likely to dabble in policy, with 22% offering some sort of policy prescriptions – often quite limited and lame in my humble coding experience. Still, curiously, scholars nonetheless perceive themselves differently. 29% of scholars says they are doing policy-oriented research. This could be entirely true if they are doing this outside the normal venues of published research in the discipline and we’re simply not capturing it in our study (blogs, anyone?). All of which begs several questions: are IR scholars really engaging in policy debates? If so, how? Where? If not, why not? (Hint: fill out the next TRIP survey in the fall 2011 and we’ll find out!!)

(Note to readers: I was unable to provide a link to the draft study that Jason and I conducted on books, as it is not yet ready for prime time on the web. But if you have any questions about our project, feel free to email me).


Stuff political scientists like #5 — a Large N

I have been doing a lot of work with survey data lately, as well as some reading in critical theory. Maybe that inspired my deconstruction of the gendered language of stats. Or maybe I just like to work blue.

Your girlfriend has told you, “Honey, your data set is big enough for me. It’s OK if it doesn’t get you into the APSR.” She might tell you, “It is not the size of p-value that matters, it is what you do with it.” A good theory can make up for a large-N, she reassures you. But political scientists know the truth. Size matters. Political scientists like a large-N.

A large-N enables you to find a statistically significant relationship between any two variables, and to find evidence for any number of crazy arguments that are so surprising, they will get you published. Political scientists like to be surprised. Your theory might be dapper and well dressed, but without the large-N, political scientists will not swoon. They go crazy for those little asterisks.

Some qualitative researcher might come in and show that your variables are not actually causally related, but it will be too late. You will have 200 citations on Google Scholar, and their article will be in the Social Science Research Network archive forever. Your secret is safe. Go back to Europe, qually!

Political scientists also like a large-N because it gives you degrees of freedom. You can experiment with other variables in your model without worrying about multicollinearity. You aren’t tied down to one boring variable. Political scientists like to swing.

Political scientists prefer it if the standard error in your data is smooth and consistent and does not increase as the X value rises. Consider waxing or shaving your data with simple robust standard errors if you have problems with heteroskedasticity. They also like a big coefficient that slopes upward. Doesn’t everyone? And fit, don’t forget about fit. Fit makes things more enjoyable.

It is best if your large-N data does not have a lot of measurement error. You might say, a little is natural, like when I jump in the pool, but this is not acceptable in political science. You should, however, have variation in your dependent variable. Variety is good. It keeps things spicy. When a political scientists wants to get really kinky, he or she will bootstrap his data.

It is best if your data is normally distributed, but political scientists generally forgive that. They like data of all shapes and sizes. They just close their eyes and pretend that it is symmetrical. Binomial. Fat tails. Oooh. That just sounds dirty.

Political scientists will tell you that if your dataset is not big enough, your confidence intervals will be too wide. Paradoxically, this will drain your confidence and make it harder for you to perform in the future. But don’t worry, they have drugs for that.

Don’t leave anything to chance. Get yourself a large-N. But don’t listen to those ads on TV late at night. Those quick data fixes don’t work.


Beyond Qual and Quant

PTJ has one of the most sophisticated ways of thinking about different positions in the field of International Relations (and, by extension, the social sciences), but his approach may be too abstract for some. I therefore submit for comments the “Political Science Methodology Flowchart” (version 1.3b).

Note that any individual can take multiple treks down the flowchart.

Of Quals and Quants

Qualitative scholars in political science are used to thinking of themselves as under threat from quantitative researchers. Yet qualitative scholars’ responses to quantitative “imperialism” suggest that they misunderstand the nature of that threat. The increasing flow of data, the growing availability of computing power and easy-to-use software, and the relative ease of training new quantitative researchers make the position of qualitative scholars more precarious than they realize. Consequently, qualitative and multi-method researchers must not only stress the value of methodological pluralism but also what makes their work distinctive.

Few topics are so perennially interesting for the individual political scientist and the discipline as the Question of Method. This is quickly reduced to the simplistic debate of Quant v. Qual, framed as a battle of those who can’t count against those who can’t read. Collapsing complicated methodological positions into a single dimension obviously does violence to the philosophy of science underlying these debates. Thus, even divisions that really affect other dimensions of methodological debate, such as those that separate formal theorists and interpretivists from case-study researchers and econometricians, are lumped into this artificial dichotomy. Formal guys know math, so they must be quants, or at least close enough; interpretivists use language, ergo they are “quallys” (in the dismissive nomenclature of Internet comment boards), or at least close enough. And so elective affinities are reified into camps, among which ambitious scholars must choose.

(Incidentally, let’s not delude ourselves into thinking that mutli-method work is a via media. Outside of disciplinary panels on multi-method work, in their everyday practice, quantoids proceed according to something like a one-drop rule: if a paper contains even the slightest taint of process-tracing or case studies, then it is irremediably quallish. In this, then, those of us who identify principally as multi-method stand in relation to the qual-quant divide rather as Third Way folks stand in relation to left-liberals and to all those right of center. That is, the qualitative folks reject us as traitors, while the quant camp thinks that we are all squishes. How else to understand EITM, which is the melding of deterministic theory with stochastic modeling but which is not typically labeled “multi-method”?)

The intellectual merits of these positions have been covered better elsewhere (as in King Keohane and Verba 1994, Brady and Collier’s Rethinking Social Inquiry, and Patrick Thaddeus Jackson’s The Conduct of Inquiry in International Relations). Kathleen McNamara, a distinguished qualitative IPE scholar, argues against the possibility of an intellectual monoculture in her 2009 article on the subject. And I think that readers of the Duck are largely sympathetic to her points and to similar arguments. But even as the intellectual case for pluralism grows stronger (not least because the standards for qualitative work have gotten better), we should realize that is incontestable that quantitative training makes scholars more productive (in the simple articles/year metric) than qualitative workers.

Quantitative researchers work in a tradition that has self-consciously made the transmission of the techne of data management, of data collection, and the analysis of data vastly easier not only than its case-study, interpretivist, and formal counterparts but even than quant training a decade or more ago. By techne, I do not mean the high-concept philosophy of science. All of that is usually about as difficult and as rarefied as the qualitative or formal high-concept readings, and about as equally useful to the completion of an actual research project–which is to say, not very, except insofar as it is shaped into everyday practice and reflected in the shared norms of the average seminar table or reviewer pool. (And it takes a long time for rarefied theories to percolate. That R^2 continues to be reported as an independently meaningful statistic even 25 years after King (1986) is shocking, but the Kuhnian generational replacement has not yet so far really begun to weed out such ideological deviationists.)

No, when I talk about techne, I mean something closer to the quotidian translation of the replication movement, which is rather like the business consultant notion of “best practices.” There is a real craft to learning how to manage data, and how to write code, and how to present results, and so forth, and it is completely independent of the project on which a researcher is engaged. Indeed, it is perfectly plausible that I could take most of the thousands of lines of data-cleaning and analysis code that I’ve written in the past month for the General Social Survey and the Jennings-Niemi Youth-Parent Socialization Survey, tweak four or five percent of the code to reflect a different DV, and essentially have a new project, ready to go. (Not that it would be a good project, mind you, but going from GRASS to HOMOSEX would not be a big jump.) In real life, there would be some differences in the model, but the point is simply that standard datasets are standard. (Indeed, in principle and assuming clean data, if you had the codebook, you could even write the analysis code before the data had come in from a poll–which is surely how commercial firms work.)

There is nothing quite like that for qualitative researchers. Game theory folks come close, since they can tweak models indefinitely, but of course they then have to find data against which to test their theories (or not, as the case may be). Neither intepretivists nor case-study researchers, however, can automate the production of knowledge to the same extent that quantitative scholars can. And neither of those approaches appear to be as easily taught as quant approaches.

Indeed, the teaching of methods shows the distinction plainly enough. Gary King makes the point well: unpublished paper:

A summary of these features of quantitative methods is available by looking at how this information is taught. Across fields and universities, training usually includes sequences of courses, logically taken in order, covering mathematics, mathematical statistics, statistical modeling, data analysis and graphics, measurement, and numerous methods tuned for diverse data problems and aimed at many different inferential targets. The specific sequence of courses differ across universities and fields depending on the mathematical background expected of incoming students, the types of substantive applications, and the depth of what will be taught, but the underlying mathematical, statistical, and inferential framework is remarkably systematic and uniformly accepted. In contrast, research in qualitative methods seems closer to a grab bag of ideas than a coherent disciplinary area. As a measure of this claim, in no political science department of which we are aware are qualitative methods courses taught in a sequence, with one building on, and required by, the next. In our own department, more than a third of the senior faculty have at one time or another taught a class on some aspect of qualitative methods, none with a qualitative course as a required prerequisite.

King has grown less charitable toward qualitative work than he was in KKV. But he is on to something here: If every quant scholar has gone from the probability theory –> OLS –> MLE –> {multilevel, hazard, Bayesian, … } sequence, what is the corresponding path for a “qually”? What could such a path even look like? And who would teach it? What books would they use? There is no equivalent of, say, Long and Freese for the qualitative researcher.

The problem, then, is that it is comparatively easy to make a competent quant researcher. But it is very hard to train up a great qualitative one. Brad DeLong put the problem plainly in his obituary of J.K. Galbraith:

Just what a “Galbraithian” economist would do, however, is not clear. For Galbraith, there is no single market failure, no single serpent in the Eden of perfect competition. He starts from the ground and works up: What are the major forces and institutions in a given economy, and how do they interact? A graduate student cannot be taught to follow in Galbraith’s footsteps. The only advice: Be supremely witty. Write very well. Read very widely. And master a terrifying amount of institutional detail.

This is not, strictly, a qual problem. Something similar happened with Feynman, who left no major students either (although note that this failure is regarded as exceptional). And there are a great many top-rank qualitative professors who have grown their own “trees” of students. But the distinction is that the qualitative apprenticeship model cannot scale, whereas you can easily imagine a very successful large-lecture approach to mastering the fundamental architecture of quant approaches or even a distance-learning class.

This is among the reasons I think that the Qual v Quant battle is being fought on terms that are often poorly chosen, both from the point of view of the qualitative researcher and also from the discipline. Quant researchers will simply be more productive than quals, and that differential will continue to widen. (This is a matter of differential rates of growth; quals are surely more productive now than they were, and their productivity growth will accelerate as they adopt more computer-driven workflows, as well. But there is no comparison between the way in which computing power increases have affected quallys and the way they have made it possible for even a Dummkopf like me to fit a practically infinite number of logit models in a day.) This makes revisions easier, by the way: a quant guy with domesticated datasets can redo a project in a day (unless his datasets are huge) but the qual guy will have to spend that much time pulling books off the shelves.

The qual-quant battles are fought over the desirability of the balance between the two fields. And yet the more important point has to do with the viability, or perhaps the “sustainability,” of qualitative work in a world in which we might reasonably expect quants to generate three to five times as many papers in a given year as a qual guy. Over time, we should expect this to lead to first a gradual erosion of quallies’ population, followed by a sudden collapse.

I want to make plain that I think this would be a bad thing for political science. The point of the DeLong piece is that a discipline without Galbraiths is a poorer one, and I think the Galbraiths who have some methods training would be much better than those who simply mastered lots and lots of facts. But a naive interpretation of productivity ratios by university administrators and funding agencies will likely lead to qualitative work’s extinction within political science.


Book Review: Codes of the Underworld

I recently finished Diego Gambetta’s Codes of the Underworld: How Criminals Communicate.  For those looking for a more academic take on signaling (particularly from a sociological point of view) it’s a great find.  As I previously mentioned, Gambetta uses the extreme case of cooperation amongst criminals to tease out more general dynamics of trust, signaling, and communication.  The Mafia can be considered a “hard-case” for theories of signaling trust; given the extreme incentives for criminals to lie and the lack of credibility they wield given the very fact that they are criminals, how is it that criminals manage to coordinate their actions and trust each other at all?  By understanding how trust works in this harsh environment we learn something about how to signal trustworthiness in broader, less restrictive environments.  As Gambetta notes:

Studying criminal communication problems, precisely because they are the magnified extreme versions of problems that we normally solve by means of institutions, can teach us something about how we might communicate, or even should communicate, when we find ourselves in difficult situations, when, say, we desperately want to be believed or keep our messages secret.

The book is a great example of studying deviant cases or outliers, particularly when the area of study is not well worn.  This is a valuable general methodological lesson.  We are typically taught to avoid outliers as they skew analysis.  However, they can be of great value in at least two circumstances: 1) Generating hypotheses in areas that have not been well studied and 2) Testing hypotheses in small-N research designs, where hard cases can establish potential effect and generalizability and easy cases suggest minimal plausibility.

Gambetta takes a number of criminal actions and views them through the lens of signaling.  This allows readers to see actions, in many cases, in completely new ways, highlighting the instrumental causes of behavior.  For example, Gambetta looks at how criminals solve the problem of identifying other criminals by selectively frequenting environments where non-criminals are not likely to go.  Since criminals cannot advertise their criminality, they face a coordination problem.  Frequenting these locations acts as a screening mechanism since only those that are criminals are likely willing to pay the costs to frequent these locations.  (This ignores the issue of undercover law enforcement, but Gambetta deals with that as well).  Gambetta also makes the reader look at prison in a new light.  Criminals derive a number of advantages from serving time in prison, not the least of which is providing them with a signaling mechanism for communicating their credibility to other criminals (as prison time can be verified by third parties).  Additionally, many criminal organizations will require that new members have already served time before they are allowed to join.  Moreover, Gambetta explores how incompetence can work to a criminal’s advantage, since it can signal loyalty to a boss who provides the criminals only real means of income (a topic I discussed here).

Gambetta also looks at the conspicuous use of violence within prisons.  This isn’t a new topic, as any law enforcement drama will undoubtedly portray the dilemma of a new inmate who must establish their reputation for toughness and resolve or else suffer constant assaults by other inmates.  However, Gambetta makes it interesting by embedding the acts in a signaling framework.

First, Gambetta’s hypothesis regarding the importance of non-material interests is borne out by various studies.  Among others, he cites one study of prison conflict that found:

“[n]on-material interests (self-respect, honour, fairness, loyalty, personal safety and privacy) were important in every incident.”  While only some violent conflicts occur for the immediate purpose of getting or keeping resources, all of them have to do with establishing one’s reputation or correcting wrong beliefs about it.  Even “a conflict that began over the disputed ownership of some item could quickly be interpreted by both parties as a test of who could exploit whom.”

Second, Gambetta hypothesizes that we should expect to see more fights when prisoners do not have enough of a violence track record when they first arrive in prison.  One observable implication of this is higher rates of prison violence among female prisoners and younger prisoners.  In fact, the empirical record bears this out quite nicely.  Rates of violence are inversely related to age, providing ” a plausible social rather than biological explanation” for youth violence.  Additionally, Gambetta finds that, although less violent in the outside world, “women become at least as violent and often more prone to violence than men”.  Interesting, women are less often convicted of violent offenses, suggesting that the results are not simply the result of selection effects.

Both points have implications for political science and international relations, given the growing use of signaling models to explain political behavior.  The issue of reputation in international relations is one that is still growing and Gambetta’s hypothesis about lack of “violence capital” fits right in to much of the current work in conflict studies.

Overall, Codes of the Underworld is unique and thought-provoking work.  For those with a strong interest in communication and signaling, it is a must read.

[Cross-posted at Signal/Noise]


In Praise of Falsification

For those that have not read it yet, The Atlantic recently featured an article profiling Dr. John Ioannidis who has made a career out of falsifying many of the findings of medical research that guides clinical practice.  Ioannidis’ research should cause us all to appreciate the various bias we may bring to our own work:

[C]an any medical-research studies be trusted?

That question has been central to Ioannidis’s career. He’s what’s known as a meta-researcher, and he’s become one of the world’s foremost experts on the credibility of medical research. He and his team have shown, again and again, and in many different ways, that much of what biomedical researchers conclude in published studies—conclusions that doctors keep in mind when they prescribe antibiotics or blood-pressure medication, or when they advise us to consume more fiber or less meat, or when they recommend surgery for heart disease or back pain—is misleading, exaggerated, and often flat-out wrong. He charges that as much as 90 percent of the published medical information that doctors rely on is flawed. His work has been widely accepted by the medical community; it has been published in the field’s top journals, where it is heavily cited; and he is a big draw at conferences. Given this exposure, and the fact that his work broadly targets everyone else’s work in medicine, as well as everything that physicians do and all the health advice we get, Ioannidis may be one of the most influential scientists alive. Yet for all his influence, he worries that the field of medical research is so pervasively flawed, and so riddled with conflicts of interest, that it might be chronically resistant to change—or even to publicly admitting that there’s a problem. [my emphasis]

Unlike most famous researchers, Ioannidis is not famous for a positive discovery or finding (unless you count his mathematical proof that predicts error rates for different methodologically-framed studies).  Instead, his status has been obtained because of his ability to falsify the work of others–to take their hypotheses and empirical research and show that they are wrong.

This is highly unusual, not only in the area of medical research, but in most academic disciplines.  The article notes that researchers are incentivized to publish positive findings–preferably paradigm altering ones–and this leads to a breakdown in the scientific method.  As Karl Popper so famously argued, knowledge accumulates based on the testing of theories that are then subjected to replication by other researchers.  If the original findings are falsified–meaning that the evidence does not support the theory–the theory is scrapped and replaced with a new theory that has greater explanatory power.  Knowledge is built through the cumulative falsification of theories.  One can think about falsification as the successive chipping away at a block of stone–the more we chip away the closer we get to an actual form.  If researchers are not incentivized to pursue falsification we all lose as a result, since incorrect findings are not vigorously retested and challenged.  According to Ioannidis, if they are challenged it is often years–if not decades-after they have been generally accepted by research communities.

It would appear that Theodore Roosevelt was not entirely correct.  The critic should, in fact, count a great deal.

[Cross-posted at Signal/Noise]


Book Blegging

Loyal Duck readers, I was hoping you might be able to help me out.

Do you have any recommendations for books about the inventive ways that people (scientists, designers, business folk, etc) have evaluated hard to test subjects? I am looking for something that is less about methodology, per se, and more about testing ideas in a practical way where either the environment or subject matter makes testing difficult (thinking here of astrophysics, for example). I am not looking for something that looks at the subject from a philosophical standpoint, but is more of a collection of examples that highlight the inventive ways people have gone about testing hypotheses in practical ways.

For example, I am thinking here of Shapiro’s famous observational test of general relativity (the Shapiro Delay), or the discovery of Neptune.

Hopefully this makes some sense. Any suggestions?

Thanks in advance!


And now for something completely different–fantasy football geekery!

[Cross-posted at Signal/Noise]

It’s that time of year again. The passing of the summer, the start of the fall. Most importantly, it signals the start of that most magical of times–the start of the fantasy football season.
This year I am in three leagues–one for work, one for “stat-geeks”, and one for family and friends. I decided to tweak the way I approached the draft this year and wanted to share a bit of the strategy with readers.

One of the biggest problems with standard pre-draft rankings, particularly those of the big fantasy hosting sites (e.g. ESPN, CBS, etc) is that the rankings are based solely on aggregate measures of performance such as total projected points for the upcoming season. Now, of course the goal is to assemble a team with players that end up scoring lots of points throughout the season, but total points scored ignores the fact that teams compete head-to-head, week-to-week. In order to make the playoffs a team has to outscore opponents on a consistent basis in order to accumulate wins, not just points. That means drafting players that not only score a lot of points, but score a lot of points week in and week out. When it comes to deciding between which players to draft, managers would be better off selecting consistent scorers versus boom-or-bust players (at least, that is my hypothesis).

Let’s take a look at two hypothetical players:

Over the course of four weeks both players score the same amount of total points. However, Player A is clearly a boom-or-bust player while Player B is more consistent week-to-week. Player A gives you a great chance to win Weeks 1 and 4, but makes it much hard to win in Weeks 2 and 3. On the other hand, Player B is the model of consistency, giving you a great chance to win each week. On most pre-draft rankings, Players A and B will look like equally valuable picks, but this is misleading.

This year, I decided to see whether a player’s penchant for boom-or-bust performances was at all consistent and predictable. The initial answer seems to be yes.

I developed two metrics; one to evaluate high scoring consistency and one that takes predicted points and combines them with scoring consistency. The first, ConBoom, measures, weights, and then combines the number of times a player scored >=20 points, >=15 points, and <10 points per game over the course of a season. This is the foundation of the consistency metric. ConRank combines the ConBoom score for a player with a weighted measure of that player's predicted total points for the upcoming season. (How am I weighting each component of the measures? Well now I can't reveal the entire secret sauce, now can I?) I validated the measures against the past three years of actual player data and found that ConBoom scores from one year were highly correlated with ConBoom scores the next year (.70). I am going to use the new metric to guide my drafts in all three leagues and essentially test how my teams fair against other teams over the course of the season. With data and predictions for every player I’ll be able to test the method over three league scoring systems, 32 teams, and 256 games as well as validate the measures predictive attributes over another year. Draft number one is tonight. Let the games begin!


The Danger of Data without Theory

I came across this Chris Anderson piece from a 2008 issue of Wired via Ana Andjelic. Anderson argues that in the era of Big Data we no longer need to rely on theory and the scientific method to achieve advances in knowledge:

Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required. That’s why Google can translate languages without actually “knowing” them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

Speaking at the O’Reilly Emerging Technology Conference this past March, Peter Norvig, Google’s research director, offered an update to George Box’s maxim: “All models are wrong, and increasingly you can succeed without them.”

…faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

There is certainly value in sophisticated data mining and an inductive approach to research, but to dismiss the deductive approach (construct theory>deduce testable hypotheses>empirically verify or falsify hypotheses) would be shortsighted.

Modern data mining may be enough to authoritatively establish a non-random relationship, and in some cases (translations and advertising) more than suffices for useful application. However, even the largest data sets still represent only a sample–and, therefore, an approximation–of reality. Moreover, establishing correlation still doesn’t get you to the underlying causal mechanisms that drive causation. Even if Google, with enough data and advanced statistical techniques, can claim that a causal relationship exists it can’t tell you why it exists.

For some subjects, “why” may not matter–do we care why Google’s program is able to accurately translate between languages, or is the practical effect enough for us? But for others it is crucial when thinking about how to construct an intervention to alter some state of being (e.g. a medical condition, poverty, civil war, etc). Understanding causal mechanisms can also help us think through the consequences of an intervention–what are some potential side effects? Are there other, seemingly unrelated, areas that might be affected by the intervention in a negative way? When we are dealing with more interconnected, complex systems (like human physiology or society) it behooves us to go beyond relationships and understand what levers are being pulled.

[Cross-posted at Signal/Noise]


The Oceanic Conference on International Studies (and Finally, Some Thoughts on Feminist Method)

I spent the last week at the Oceanic Conference for International Studies (OCIS) in Auckland, New Zealand, a conference that was something of a luxury for me in that I had no leadership responsibilities, and got to be in a beautiful city I’d never been to literally on the other side of the world.

While zorbing and bungee jumping would normally be the highlight of such a trip, actually, the Conference was. I didn’t know what to expect at first; after all, this was a regional conference far outside of my region – would there be any work of interest to me? I was pleasantly surprised on a number of levels however. First, I learned a lot about things I didn’t know about. The conference opening address taught be about the history and current struggles of the Maori people in New Zealand. I got to hear a number of interesting empirical presentations about pacific life and cultures. Second, this was hands-down the best run conference I have ever been to (and I am counting the several I have run) – the organization was perfect (literally), the accommodations were excellent and affordable, the conference facilities were amazing, adequate networking breaks were provided, and there were totally rocking (frequent and yummy) meals, snacks, and (gasp) cakes.

The thing I really got out of OCIS, though, was the panels. I don’t get to go to a lot of panels at ISA, or even at APSA sometimes – because I am frequently doing organizational work, service stuff, and professional development panels and stuff – so I’m not saying the panels @ OCIS were better than the other ones that I don’t really get to see (though I think they might have been). That said, I attended ten panels over three days (fine, nine, if you don’t count the one I was on). Highlights included: Jacqui True’s talk about gender and norms, Megan MacKenzie’s talk about the role of the family in security, Ann Tickner’s keynote on an alliance between feminist and postcolonial historical narratives, Spike Peterson’s discussion of intersectionality in feminist IR, Katrina Lee-Koo’s talk on the place of gender in Australian IR, Miranda Alison’s talk on inclusion of women combatants in peace processes, Tulia Thompson’s talk on heterogender in Fiji, talks by Tony Burke and Laura Shepherd on ethics and security, Penny Griffin’s talk on sex and global political economy, Juanita Elias’ presentation on the incorporation of feminist agendas, and Ruth Jacobsen’s discussion of the challenges of feminist data collection. The list could be longer. These presentations (and others) were high-quality and very interesting. OCIS left me not only with a reinvigorated enthusiasm for the research program that I am a part of but also with a renewed sense of desire to tackle its hard questions.

It is in that spirit that I now return to the discussion of feminist method that Patrick started a couple of weeks ago. I never ended up posting mostly because I was incredibly busy, but also because posting at that time would have required me to work through some of the methodological difficulties I had been struggling with recently, a place of discomfort for me. But the question of “is there a feminist methodology?” or (more how I would put it) “what is the appropriate methodological approach for feminist work in IR?” is an important one, and I’ll share some thoughts.

First, there is not one feminism in IR – there are diverse feminisms – liberal ones, constructivist ones, critical ones, postcolonial ones, poststructuralist ones, postmodern ones, marxist ones, etc. Those different “takes” on feminisms do have different methodological outlooks based on different epistemological assumptions.

Second, I’ve described several times that I see feminist method as a journey – one of observation, critique, revealing, reformulation, reflexivity, and action, guided by feminisms’ principles. While none of those steps are necessarily unique to feminist research, perhaps that they are linked or how they are linked is particularly feminist. If methodology is the intellectual process guiding reflection on epistemological assumptions, ontological perspective, ethical responsibilities, and method choices (Ackerly/Stern/True 2006, p.8), Patrick is right that reflexivity is a substantial part of and contribution to how to “think like a feminist” – but there’s more to it, I think.

Third, then, “thinking like a feminist” in IR, I think, has a number of key elements, which I’ll gloss over here and discuss in more detail if any readers are interested. Those include:

1) Many feminisms share with (other) critical approaches an understanding that the relationship between the knower and the known is fundamental to knowledge, and that therefore all knowledge is political, social, contextual, and intersubjective. Feminisms add, however, that a crucial part of the position(s) of knowledge(s) is their position along gendered hierarchies of social and political thought

2) With (some) other scholars, (many) feminisms see knowledge as personal, theory as practice, and reflexivity as a key part of the research process. Unlike (most) other scholars, feminisms view that reflexivity through gendered lenses, where, in Jill Steans’ words (1998, p.5), “to look at the world through gender lenses is to focus on gender as a particular kind of power relation” and therefore to trace out the ways in which gender is not only central to understanding international processes, but the ways in which gendered assumptions shape our research on gendered assumptions.

3) Like (some) development scholars, feminist researchers have recognized a difference between power-over and empowerment. Feminists, however, both understand the key role that gender has in that distinction, and that it is to be analyzed not only in the world “out there” but also in the discipline.

4) As some feminists have recently discussed (like a 2010 forum in politics and gender), feminist work is likely to look to deconstruct the quantitative/qualitative divide in IR rather than taking a “side” in it – feminist stakes in epistemology and method (where they exist) are about the purpose of the tools (in service of discovering and deconstructing gender hierarchies) rather than what the tools are. That is (oversimplified, of course) it is a question of positivism/post-positivism instead of a question of quantitative/qualitative. While not all feminists/feminisms would agree, I would argue that quantitative methods could be used effectively to serve (postpositivist, epistemologically skeptical) feminist ends.

5) This does not mean that most feminisms see utility in the use of gender as a variable (which usually means “sex” as a variable in practice in the literature). Feminist questions above all inspire feminist methods. Feminist questions often ask “how do masculinities and femininities define, constitute, signify, cause, reproduce, and become reproduced by x?” rather than “are women more x?” – the former question is one about gender hierarchy; the latter is an essentialized approach to sex that assumes that gender hierarchy either does not exist or is irrelevant to answering the question. (Most) feminisms in IR prefer the first.

6) Though there are no essential tools for feminist IR, there are a group of tools feminisms have found useful: dialectical hermeneutics, ethnography, critical discourse analysis, in-depth case studies, feminist interviewing, and other tools. Good feminist method essays (like Brooke Ackerly’s in Audie Klotz and Deepa Prakash’s edited volume) and books (Ackerly/Stern/True 2006, and Ackerly/True 2010) expand on these ideas.


Methodology Lessons: DOE’s Natural-gas Overstatement

[Cross-posted at Signal/Noise]

The Wall Street Journal reported yesterday that the US Department of Energy is set to restate the data it collects on U.S. natural-gas production. The reason? The Department has learned that its methodology is seriously flawed:

The monthly gas-production data, known as the 914 report, is used by the industry and analysts as guide for everything from making capital investments to predicting future natural-gas prices and stock recommendations. But the Energy Information Administration (EIA), the statistical unit of the Energy Department, has uncovered a fundamental problem in the way it collects the data from producers across the country—it surveys only large producers and extrapolates its findings across the industry. That means it doesn’t reflect swings in production from hundreds of smaller producers. The EIA plans to change its methodology this month, resulting in “significant” downward revision.

The gap in output between what the 914 report has been predicting and what is actually occurring has been growing larger and larger. Many analysts have long suspected the methodology underlying the reports was faulty, but the EIA has been slow to revise it. The overestimation of output has depressed prices, the lowest in 7 years. Any revision to the methodology will bring about a “correction” in energy markets and particular states will surely see their output dip significantly.

So what can we learn from this from a methodological perspective? A few things:

  1. How you cast the die matters: The research methodology that we employ for a given problem significantly impacts the results we see and, therefore, the conclusions we draw about the world. The problem with the DOE’s 914 report wasn’t simply a matter of a bad statistical model, it was the result of unrepresentative data (i.e. relying only on the large producers). This isn’t simply an issue of noisy or bad data, but of systemic bias as a result of the methodology employed by the EIA. The data itself is seemingly reliable. The problem lies with the validity of the results, caused by the decision to systematically exclude small producers and potentially influential observations from the model.
  2. Representativeness of data doesn’t necessarily increase with the volume of data: More than likely the thought went that if the EIA collected data on the largest producers they’re extrapolations about the wider market would be sound–or close enough–since the largest players tend to account for the bulk of production. However, as we see with the current case, this isn’t necessarily true. At some point in history this methodology may have been sound, but it appears that changes to the industry (technology, etc) and the increased importance of smaller companies have rendered the old methodology obsolete. Notice that the EIA’s results are probably statistically significant, but achieving significance really isn’t that difficult once your sample size gets large enough. What is more important is representativeness–is the sample you’ve captured representative of the larger population? Many assume that size and representation are tightly correlated–this is an assumption that should always be questioned and, more importantly, verified before relying on the conclusions of research.
  3. Hypothesis-check your model’s output: The WSJ article notes that a number of independent analysts long suspected a problem with the 914 reports by noticing discrepancies in related data. For example, the 914 report claimed that production increased 4% in 2009. This was despite a 60% decline in onshore gas rigs. If the 914 report is correct, would we expect to see such a sharp decline in rigs? Is this logically consistent? What else could have caused the 4% increase? The idea here is to draw various hypotheses about the world assuming your conclusions are accurate and test them–try to determine, beyond your own data and model, whether your conclusions are plausible. Too often I’ve found that business fail to do this (possibly because of time constraints and less of a focus on rigor), but academics often fall into the same trap.


Think like a methodologist

[Cross-posted at Signal/Noise]

In keeping with Patrick’s theme on methodological discussions, I thought I would cross-post this recent piece from my personal blog.

Nathan at Flowing Data puts words to an idea I’ve had for a while, but could never figure out how to communicate. He writes, “[T]he most important things I’ve learned [in statistics courses] are less formal, but have proven extremely useful when working/playing with data.” Some of the lessons learned include:

  • [T]rends and patterns are important, but so are outliers, missing data points, and inconsistencies.
  • [I]t’s important not to get too caught up with individual data points or a tiny section in a really big dataset.
  • [D]on’t let your preconceived ideas influence the results.
  • The more you know about how the data was collected, where it came from, when it happened, and what was going on at the time, the more informative your results and the more confident you can be about your findings.
  • [A]lways ask why. When you see a blip in a graph, you should wonder why it’s there. If you find some correlation, you should think about whether or not it makes any sense. If it does make sense, then cool, but if not, dig deeper.

The point is that regardless of whether you are formally trained in and choose to leverage sophisticated statistical methods, there is a great deal to be gained by thinking like a statistician. I would actually go farther here and say that the statistician part is somewhat besides the point. Thinking like a methodologist is the key.

I would agree with Nathan that the most translatable skills that I learned in graduate school are methodological in nature. More specifically, there isn’t a particular technique that is most useful, but rather a mode of thinking that allows me to approach problems in a rigorous fashion. Now, rigorous does not have to equate to statistical. Rather, it encompasses all the various methods by which we try to separate wheat from chafe, fact from fiction.


Explaining, broadly understood

So Charli — who did, in fact, attend our ISA Battlestar Galactica panel in costume — has posted elsewhere about the panel. Though she calls it the “best event [she] attended at ISA this year,” she does express some disappointment about the content of the panel when compared with the panel’s title:

Unfortunately the panel turned out to be misnamed however, for none of the papers really spoke to the question of whether BSG has an impact on actual world politics. … Admittedly, the papers weren’t really trying to do that kind of explanatory work – the panel really was misnamed – so this isn’t a criticism as much as an observation.

Certainly, as one commentator over at LGM has already pointed out, there’s sufficient ambiguity in the word “explains” that it need not quite mean “influences” or “impacts.” But what’s intriguing to me, methodologically speaking, is that Charli, along with not a few others in our field, tend to go directly from “explain” to “impacts,” and in particular, to “exerts an independently-measurable causal impact on.” That says something revealing about the field, and it might just help to explain why the study of popular culture continues not to make a lot of headway in the disciplinary mainstream.

Some definitions first. In Charli’s formulation, for BSG to “explain” world politics, the show or the viewing of the show has to function as an independent variable, which means that in order for it to matter it has to be shown to be correlated in some relatively robust way with a measurable outcome. In that way, BSG could “explain” observed variance, perhaps best visible if we compared a BSG-watching-and-discussing national or international security bureaucracy with a non-BSG-watching-and-discussing one, or a BSG-watching public with a non-BSG-watching one. This does set up some tricky measurement problems — how do we count watching and discussing, how do we specify the intervening steps between the show and the outcome — and raises the specter of spurious correlation all over the place, but in principle there’s nothing altogether impossible about conducting a study like this. The challenge would be to isolate the impact of BSG, which would probably require a more elaborate theoretical account of human perception than we have at present; it’s one thing to see people in their office cubicles discussing BSG and then writing counter-terrorism policies, and another to demonstrate that their discussing BSG exercised an independent effect on the subsequent policies.

[And parenthetically, even if BSG were subsequently cited — either in the final policy statement, or in interviews after the fact — that would tell us precisely nothing except that some people who participated in the process thought and think that BSG was important. That doesn’t make them right for thinking so. If, hypothetically, BSG was referenced when the policy was concluded and presented, we’d have a heck of a time determining whether this was a strategic rhetorical move designed to appeal to the audience (think of Reagan’s use of the phrase “star wars” here) or some kind of an expression of a sincere belief, Similarly, if people interviewed later on cited BSG, or 24, or some other TV show as influential in their thinking, we’d have a heck of time distinguishing between a) strategic re-presentation of events in the light of contemporary concerns; b) mis-remembering what happened, perhaps benignly and perhaps more craftily; and c) actual influence. So that kind of primary source evidence isn’t enough, in any case, to establish that BSG had an effect; instead, we need some hypothesis about what viewing and discussing BSG would imply behaviorally, so that we could look for the appropriate kinds of indicators not in what people said about why they did what they did, but in what they actually did.]

But fortunately for the rest of the panel, “explain” doesn’t just mean this kind of neopositivist methodological strategy. Indeed, it’s even something of a misnomer to say that an independent variable “explains” an outcome; strictly speaking, the whole statistical model explains the outcome, and the various independent variables participate in or contribute to the explanation — or, perhaps, they explain a certain portion of the observed variance. So if we wanted to be precise here, even if we were following Charli’s neopositivist implicit definition of explanation, we would have called the panel “How Battlestar Galactica Contributes to the Explanation of World Politics.” This inelegant reformulation does, however, permit a number of things to fit more easily underneath its umbrella, including the paper that I presented on that panel on “Battlestar Galactica as Methodology” in which I suggested that social scientists could learn something from BSG about the way we construct our explanations of world politics. That would not be BSG as causal factor — truth to tell, I think that the measurement and conceptualization problems with the BSG-as-IV thing that Charli proposes are actually insoluble, and pretty much everyone doing audience-reception studies agrees that drawing an independent causal connection between something that you see on TV and something that you subsequently do is pretty much impossible — but BSG as inspiration, BSG as an exploration of a certain ideal-typical formulation of values as they confront a variety of fantastic empirical situations, and BSG as a model of how social scientists ought to analyze concrete cases in terms of how the logical implications of a given set of values are or are not realized in practice. (My remarks on the panel were recorded; podcast here, if anyone wants to check them out.)

But the bigger issue here is not whether Charli or I are correct about what to do with BSG, especially since I don’t think that there’s anything illegitimate about either of these methodological strategies. The issue is that we don’t have a very good lexicon in IR for discussing what to do outside of the hypothesis-testing, IV-DV kind of explanation that Charli is proposing. I worry about that for a lot of reasons, not in the least because I think that a claim about the independent causal impact of popular culture is kind of foredoomed to failure and I like studying popular culture (among other things) because I think it does matter — but it “matters” in a different way. I have written a book trying to address this situation, and I want to do something else about it too: a new department here at the Duck that I’m calling “Methodology411.” I will post something launching that department sometime in the next couple of days, when I will also explain where the name comes from.

But for the moment, let me just post an excerpt from an e-mail exchange that Charli and I had about a slightly different point concerning methodology and BSG, by way of showing that the number of methodological issues involving the study of things like popular culture is actually quite a large one, and gets larger once you plunge into actual empirics. Food for thought, and perhaps, for subsequent discussion.

Charli: I have been thinking more about your argument re. humanity using the example of the moment with Helo. The more I think about it the more I don’t buy it. I remember the problem with that scene was that Helo is in a love relationship with a Cylon and his child is Cylon. I think this really dilutes the moral strength of his argument as he appears in the episode to simply be promoting his own self-interest – not arguing in favor of sacrificing self-interest on moral ground to include “the other.” For him, Cylons are part of the “in-group” already, so it’s not exactly a hard case…

PTJ: Your analysis here makes two unfortunate conflations — conflations that put your objections on a page quite different from my original claim [from the panel]. Substantively, you conflate an analysis of Helo’s motives with an analysis of the social meaning of Helo’s actions, and methodologically, you conflate an analysis of the characters in BSG with an analysis of BSG as a whole narrative product.

Substance first. In your objection you have subtly shifted the question from the terms of Helo’s argument to the reasons that he might have for making it. This is ‘reductionism’ in the precise terms that Ken Waltz meant it: you take the social and reduce it to the individual, as though social action were merely the aggregate of individual behavior. But social action is meaningful, which means — by definition — that it includes a component that is not reducible in this way, since it involves shared sensibilities. ‘The social’ is not just a bunch of individuals, in the formulation I’m using; logically speaking, the social *precedes* the individual, and in fact ‘the individual’ is a site or a node in a structured social network. Why one individual does one particular thing, and how that thing was possible in the first place, are different kinds of questions, and they don’t preclude one another; Helo might indeed have been motivated by the kinds of concerns that you suggest motivated him, but that is strictly speaking irrelevant to my argument. ‘Sincerity’ is not an empirical phenomenon, but a normative judgment, and it doesn’t matter what Helo was really thinking; what matters is what he said (the socially sustainable vocabulary that he used), how people made sense of what he said (which might involve *their* judgments on his sincerity, but those would tell us precisely nothing at all about whether or not Helo really was or really was not sincere), and what resulted from his intervention. And in those terms, what’s fascinating about Helo’s action is that it might have involved a claim that the non-human Cylons had, in some sense, humanity — that they were what Orson Scott Card (in his brilliant novel Speaker for the Dead) would call “ramen,” or humans of another species. Motive doesn’t matter; socially meaningful action does.

Now, methodology. You seem to be focusing on this scene in isolation from the rest of the text, as though BSG as a whole could and should be treated as a novel set of empirics to be analyzed in the same way that we might analyze the historical events of our ‘regular’ world politics. But BSG is, of course, *fictional*, which to me means that it has a necessarily world-constituting quality that actual historical events don’t necessarily have. When we analyze world politics, when we are engaged in producing social-scientific accounts of world politics, we are of necessity treating world politics as the raw material from which we produce our accounts. There’s no ‘authorial intent’ to worry about, and no pre-existing plotline save the one that our methodology and our analysis helps to disclose. BSG is different: not because what Ron Moore and company think about BSG should necessarily be controlling for our interpretations, but because the story is a *story* and as such has a plot of its own. Ignoring that — which is what modern-day realists do when they mis-read Thucydides as advocating the Athenian philosophy “the strong do what they can and the weak suffer what they must,” since Thucydides then goes on to depict the tragic consequences of this bit of hubris as the book unfolds — is problematic. When I say that BSG’s value-system is critical humanist, I do not mean that any particular character is a critical humanist (although some are or become so, including Helo, and this scene is a pivotal moment when he consolidates that position), but that the series as a whole expresses critical humanism. That expression is sometimes found in dialogue, but it is more often found in plot and outcome, especially the show’s relentless demolition of every particular definition of ‘the human’ and its replacement by something even more encompassing — culminating, of course, in the revelation of ‘mitochondrial Eve’ as a Cylon/Colonial hybrid.


Randomized Controlled Trials: Just Abstain

A new study (summarized here; full text available here if you’re lucky enough to be at an institution with a medical school and an institutional subscription to the relevant journal) released yesterday purports to have some implications for sex education policy. By conducting what the authors — or the publicists at the University of Pennsylvania, at any rate — refer to as the first randomized controlled study of the effectiveness of various kinds of interventions, and finding that the abstinence-only intervention was more effective at encouraging teens to delay sexual activity than safe sex or abstinence-plus-safe-sex programs. The numbers aren’t overwhelming — 33.5% of adolescents reported having sex in the 24 months following their participation in the abstinence-only program, while 48.5% of the students in other programs reported having had sex during that period of time — but they look compelling.

At least, they look compelling at first glance. Despite the authors’ own admirable cautionary notes regarding the need for further research before any policy implications can be solidly grounded, the pundits seem to be lining up as expected, and deploying the results in a decontextualized manner: the researcher at the Heritage Foundation who wrote the federal guidelines for funding abstinence-only programs is (big shock here) pleased that the study validates what he always maintained, while the critics of abstinence-only (in an equally big shock) deny that the study validates the programs that they oppose. Politicized science, indeed.

The problem here is that both sides of the political discussion appear to fundamentally misunderstand the methodology involved in a study like this — and this misunderstanding permits them to drawn erroneous conclusions about what the results actually mean. This is a little more serious than “correlation is not causation,” although it begins there; in fact, the issue is more like “662 African American students from four public middle schools in a city in the Northeastern United States are not a laboratory.” As Nancy Cartwright (among many other philosophers of science) has pointed out, the fundamental error involved in the interpretation of randomized controlled trials (RCTs) is that people mis-read them as though they had taken place under controlled conditions, when they actually did not; in consequence, generalizing beyond the specific trial itself is a process fraught with the potential for error.

Consider, for a moment, what makes a laboratory trial “work” as a way of evaluating causal claims. If I want to figure out what chemical compound best promotes longevity in fruit flies or mice, the first thing I do is to make sure that my entire stock of experimental subjects is as similar to one another as possible on all factors that might even potentially affect the outcome (a procedure that requires me to draw on an existing stock of theoretical knowledge). Then I work very hard to exclude things from the laboratory environment that might affect the outcome — environmental factors, changes in general nutrition, etc. And when conducting the trials, I make sure that the procedures are as similar as humanly possible across the groups of experimental subjects, again drawing on existing theory to help me decide what variations are permissible and which are not. All of this precise control is made possible by the deliberately artificial environment of the laboratory itself, and at least in principle this precise control allows researchers to practically isolate causal factors and their impact on outcomes.

Now, the problem is that the actually-existing world is not a laboratory, but a much more open system of potential causal factors interacting and concatenating in a myriad of ways. Scientific realists like Cartwright bridge the gap between the laboratory and the world by presuming — and I stress that this is a theoretical presumption, not an empirical one — that the same causal factors that operated in the laboratory will continue to exert their effects in the open system of the actual world, but this most certainly does not mean that we will observe the kinds of robust correlations in the actual world that we were able to artificially induce in the laboratory. Hence, what is required is not a correlation analysis of actual empirical cases, but detailed attention to tracing how causal factors come together in particular ways to generate outcomes. (Sections 1.2 and 1.3 of this article provide a good, if somewhat technical, account of the conceptual background involved.)

So causal inference goes from the controlled laboratory to the open actually-existing world, and we can make that move precisely to the extent that we presume that objects in the lab are not fundamentally different from objects in the world. The problem with an RCT is that it turns this logic completely on its head, and seeks to isolate causal factors in the actual world instead of in the laboratory, and as evidence of causation it looks for precisely the kind of thing that we shouldn’t expect in an open system: namely, robust cross-case correlations. Following 662 students from four middle schools over a period of several years is in basically no significant respect anything like putting 662 mice on a variety of diets in a lab and seeing which groups live the longest; the number of potentially important factors that might be at work in the actual world is basically a countably infinite quantity, and we have precisely no way of knowing what they are — or of controlling for them. No lab, no particular epistemic warrant for correlations, even robust ones; they might be accidental, they might be epiphenomenal, heck, they might even be the unintentional result of sampling from the tail-end of a “crazy” (i.e., not a normal) distribution. All the technical tricks in the world can’t compensate for the basic conceptual problem, which is that unless we make some pretty heroic assumptions about the laboratory-like nature of the world, an RCT tells us very little, except for perhaps suggesting that we need to conduct additional research to flesh out the causal factors and processes that might have been at work in producing the observed correlation. In other words, we need better theory, not more robust correlations.

The limitations of RCTs can perhaps be even more clearly grasped if we think about the marvelous machine that is organized Major League Baseball: 30 teams playing 162 games each over the course of each six-month-long season, and doing so under pretty rigorously controlled conditions. Indeed, MLB is a kind of approximate social laboratory, where players are required to perform basically similar actions over and over again; pitchers throw thousands of pitches a season, batters can have hundreds of plate appearances, and so on. And over everything is a bureaucracy working to keep things homogeneous when it comes to the enforcement of rules. It’s not a perfect system — “park effects” on pitcher and batter performance are measurable, and sometimes players cheat — but on the whole it’s a lot closer to a closed system than four middle schools in the Northeastern United States. But even under such conditions, there are prediction failures of epic proportions, as when a team pays a great deal of money to acquire a player who previously performed very well (cough cough Yankees acquiring Randy Johnson cough cough) only to discover that some previously-unaccounted-for factor is now at work preventing their performance from reaching its previous heights. Or there are celebrated examples like the more of less complete collapse of a previously elite player like Chuck Knoblauch when moving from small-market Minnesota to huge-market New York — something that looked like a very robust correlation between a player and his performance turned out to be in part produced by something hitherto unknown. It works in reverse too: players who did badly someplace resuscitate their playing careers after signing with different teams, and there is precisely no perfect system for predicting which players will do that under which conditions.

My point is that if the laboratory-like environment of MLB doesn’t produce generally valid knowledge that can survive the transplantation of players from team to team — in effect, if the results of previous laboratory trials are at best imperfect predictors of future laboratory trials, and we can only determine in retrospect how good a player was by looking at his overall playing career statistics — what hope is there for an RCT study conducted under much more uncontrolled conditions? At least in baseball one can say that all of the trials take place under similar conditions, and given the absurdly large n that can be worked with if one aggregates performance data from multiple seasons of play, it is possible to develop probabilistic forecasting models that have some reasonable chance of success on average. But the practical condition of this kind of operation is the approximate closure of the environment produced by the organization of the MLB season; this is not merely a quantitative convention, but an actual set of social actions. In the absence of such practical conditions, a robust correlation counts for very little, and seems like a very thin reed on which to base public policy decisions.

Again, what we need is better theory, not better correlations. How do sex education programs work? What kind of processes and mechanisms are involved in educating an adolescent about sex, and how do those work together in the actual world to generate case-specific outcomes? That’s the basis on which we ought to be having this discussion — and, not incidentally, the discussion of every other public policy issue which we mistakenly refer to correlation studies conducted in the open system of the actual world as helping us to puzzle out. A robust correlation is neither necessary nor sufficient for a causal claim, and until we accept that, we will never avoid this kind of gross mis-use of scientific results for partisan purposes.


Open-ended vs. Scale Questions: A note on survey methodology

Aaron Shaw had an interesting post at the Dolores Labs blog last week that examined how using different question scales in surveys can elicit very different responses:

You can ask “the crowd” all kinds of questions, but if you don’t stop to think about the best way to ask your question, you’re likely to get unexpected and unreliable results. You might call it the GIGO theory of research design.

To demonstrate the point, I decided to recreate some classic survey design experiments and distribute them to the workers in Crowdflower’s labor pools. For the experiments, every worker saw only one version of the questions and the tasks were posted using exactly the same title, description, and pricing. One hundred workers did each version of each question and I threw out the data from a handful of workers who failed a simple attention test question. The results are actual answers from actual people.

Shaw asked the same question to both samples but altered the scale of the available answers:

Low Scale Version:
About how many hours do you spend online per day?
(a) 0 – 1 hour
(b) 1 – 2 hours
(c) 2 – 3 hours
(d) More than 3 hours

High Scale Version:
About how many hours do you spend online per day?
(a) 0 – 3 hours
(b) 3 – 6 hours
(c) 6 – 9 hours
(d) More than 9 hours

He found that there was a (statistically) significant difference in the responses he received from questions using both the high and low scales. More specifically, more people responded that they spent more than 3 hours online per day when presented with the high scale question. Additionally, more people exposed to the high scale responded that they spend less than 3 hours online per day. What accounts for this? Shaw hypothesizes that it is the result of satisficing:

[…] it happens when people taking a survey use cognitive shortcuts to answer questions. In the case of questions about personal behaviors that we’re not used to quantifying (like the time we spend online), we tend to shape our responses based on what we perceive as “normal.” If you don’t know what normal is in advance, you define it based on the midpoint of the answer range. Since respondents didn’t really differentiate between the answer options, they were more likely to have their responses shaped by the scale itself.

These results illustrate a sticky problem: it’s possible that a survey question that is distributed, understood, and analyzed perfectly could give you completely inaccurate results if the scale is poorly designed.

It’s an important point–how you ask a question can have a significant impact on the answers you get. Or put another way, you need to pay as much attention to design and structure of your questions (and answers) as to the content of those questions.

A number of commentators chimed in about when it is better to use scale versus open-ended questions. One major advantage that comes immediately to mind is that scale questions don’t require analysts to spend additional time coding answers before commencing with their analysis. While open-ended questions may avoid the issue of satisficing (which I am not convinced they do–respondents could easily reference their own subjective scale or notions), they do place an additional burden on the analyst. For short, small-n surveys this isn’t that big of an issue. However, once you start scaling up in terms of n and the number of questions it can become problematic. Once you get into coding there are all sorts of issues that can arise (issues of subjectivity and bias, data entry errors, etc). Some crowdsourcing applications like Crowdflower may provide a convenient and reliable platform for coding (as I’ve mentioned before), but at some level researchers will always have to make an intelligent trade-off between scale and open-ended questions.

[Cross-posted at bill | petti]



During any Presidential Administration, there are heated debates, accusations of horrible mismanagement, and political intrigue, but they are actively papered over and downplayed by a powerful White House communications operation dedicated to protecting the image of the President. Once everyone leaves office, however…..

It seems the floodgates of insider accounts that “make news” and tell heretofore unknown details about the good old days of the Bush Administration are opening, and the stream of details might be more interesting than most.

Tom Ridge, the first secretary of Homeland Security, has a memoir coming out September 1, and the tease of salacious material is the revelation that the Administration did in fact manipulate the color-coded threat level with political considerations in mind.

Former Treasury Secretary Hank Paulson is also working on his memoir, and there are sure to be others (I’m not going to do the exhaustive list, you get the idea…)

The most interesting of course are those from the principles themselves. Former President Bush is working on a memoir where he revisits the 10 most important decisions from his presidency, focusing on terrorism and how it dominated his presidency.

And of course there’s the revelation that former VP Cheney is also at work, penning a memoir (the old fashion way, on legal pads that someone else can type up for him…) where he breaks with Bush on some key issues. Cheney, of course, was famous for deriding those who wrote tell-all books, right up until he started writing one.

With all these memoirs, there will be the obligatory book tour and media appearances on all the major cable TV outlets. These guys need to sell books, so they will lay out some hints of juicy gossip and brilliant insight.

As interesting, I think, is methodological question of how to use these documents as sources for the upcoming article on decision-making in the Bush White House (what–you don’t have that started yet?). On the one hand, these are valuable, primary source documents, the recollections of decision-makers and participants (or at least recollections as told to their ghost-writers/assistants). For scholars writing about the massive shifts in US foreign policy of the first Bush term, it might be useful to include both Bush’s and Cheney’s views on a key decision–information that can easily be gleaned from memoirs.

However, its important to be careful how one uses memoirs. I’m reminded of the exchange between Brooks and Wholforth and English about the end of the Cold War. In arguing over competing arguments over the same events using much of the same evidence, they pick a fight over how to interpret the memoirs of Gorbachev and other high party officials. Each claims that the memoirs support the argument.

Today, memoirs are about selling books and continuing the image-making process. That said, they still reveal interesting details about a situation that won’t appear in any contemporaneous journalism or even archived memos.

What I’d really like to see–once all the “good” memoirs come out–is a discourse analysis of Bush Administration memoirs. Viewing these books as part of the construction of history rather than attempts at more accurate reconstructions of historic events would be quite the interesting project. Something to file away under the to-do list….


Academics say the darndest things….

From a recent article on social-science methodology:

For example, gravity is a trivial necessary cause of revolution, because gravity is simply always present regardless of whether or not a revolution happens.

Clearly, not everyone in my field is a science-fiction fan.

nb: someone has suggested to me that the authors mean “revolution” as in “Venus revolves around the Sun.” But gravity is certainly not a non-trivial cause of such revolution; given the context of the article, I’m pretty sure the authors use the term in the “grab the pitchfork and storm the castle” sense….


Some Rambling Thoughts on the Qual/Quant Pseudo-Divide

Perusing Drew Conway’s excellent blog Zero Intelligence Agents in response to his comment on a previous post, I came across this post of his, reacting to Joseph Nye and Daniel Drezner’s recent bloggingheads diavlog on the theory/policy debate.

You can watch the relevant portion above, though Conway has summarized a key point:

Drezner notes that quantitative scholars tend to have a ‘imperialistic attitude’ about their work, brushing off the work of more traditional qualitative research.

To be exact, by “quantitative scholars” Drezner was referring to those who use “statistical methods and formal models” and by “traditional qualitative research” he meant specifically “more historical / deep background knowledge that’s necessary to the policymaker.” Conway goes on to concur:

In some respect I agree. As a student in a department that covets rational choice and high-tech quantitative methods, I can assure you none of my training was dedicated to learning the classics of political science philosophy. On the other hand, what is stressed here—and in many other “quant departments”—is the importance of research design. This training requires a deep appreciation of qualitative work. If we are producing relevant work, we must ask ourselves: “How does this model/analysis apply to reality? What is the story I am telling with this model/analysis?”

I’d been wanting to put in my two cents since I saw this particular bloggingheads, so I’ll just do so now. I think there are three unnecessary conflations here.

First, between qualitative or quantitative methodologies as approaches and specific methods within either of these two approaches. Drezner is comparing large-N statistical studies to historical case studies. But case study research is only one type of qualitative work – not all other types of qualitative work are any more useful for policymakers than large-N statistical studies.

Second, I see a confusion here between qualitative methods as an approach to doing social science and interpretivism as a form of theory (and for that matter, between large-N empirical studies and abstract formal modeling). In his post, Conway equates qualitative methods not with historical descriptive work, but with political theory (or as Conway puts it, political philosophy) and interpretivism. There is a wide continuum of qual methods, some much more scientifically rigorous – that is, focused on description and explanation rather than interpretation or prescription – than others. I also think that there is a similar difference between large-N statistical studies and formal modeling – one relies on data to test theories, the other relies on abstract math and logic and is largely divorced from real-world evidence.

In both cases, I think the imperialism being described above (if any) is really the imperialism of empirical science over pure theory. I think that the imperialism of quantitative methods over qualitative methods must be judged, if it exists, against only qualitative approaches that are actually designed to be scientific. Within that context, you may be surprised how much respect these scholars have for one another’s work – though, perhaps that’s just based on my good experiences collaborating and communicating with quantoids, experiences others may not share.

Third and finally, I think researchers and their methods are being conflated here. Bloggingheads.tv is perhaps most guilty by labeling this clip “quals v. quants” as if these methods are mutually exclusive and as if scholars are defined by the methods they use. (And in fact, I just noticed I did it myself in the previous paragraph with the term “quantoids.”) But most of the doctoral dissertations I see coming out today use mixed-methods – that is, some combination of case studies and statistics. And much qualitative work, including much of my own, is actually quantitative as well. It’s qualitative insofar as I’m studying text data and using grounded theory to generate analytical categories. But it’s quantitative in the sense that I convert those categories (codes) into frequency distributions that tell us something about the objective properties in the text, and in the sense that I use mathematical inter-rater reliability measures to report just how objective those properties are through inter-rater reliability measures.

Anyway, as a self-identified qualitative scholar whose work varies between interpretivism and rigorous social science studies of text (and who therefore is quite conscious of the difference), but who is also quite open to collaborating with quantitative researchers depending on the nature of the problem I’m working on, I hate to buy into a discourse that pigeonholes IR scholars as one thing or another.

Ultimately, I think the distinction Nye and Drezner are really talking about here is not methodological. Rather, it’s between those scholars capable of translating their findings (through whatever method) into language accessible to policymakers, and those who refuse to learn those skills. As I argued once before, perhaps this process of translating is “methodology” of its own that we should be incorporating into our doctoral curriculum as a discipline.


Being there

In the latest incarnation of the Iraq war issue in the general election, John McCain is criticizing Barak Obama because Obama hasn’t been to Iraq in some time, and therefore, he’s not qualified to comment on Iraq policy because he hasn’t “been there” to “see it for himself.”

Rhetorically, it’s a slick move by McCain. Take a widely perceived negative, his support of the war, and turn it into a positive by emphasizing experience and criticizing Obama’s capacity for sound judgment. There was some press speculation that Obama might now need to visit Iraq as a candidate to blunt this line of attack, which plays into McCain’s hands because its debating the issue on his turf.

This, however, raised a larger issue for me, one with implications not just for the election, but for research methods in the social sciences. Namely, how important is it to be there (or have been there) in order to make an argument and draw a defensible conclusion about a thing. We seem to have a fetish for certain types of experience, thinking it leads to insight about how certain things work. But such doesn’t always seem to be the case.

Take, for example, baseball. You’ll notice that the world of baseball analysts, managers, and team executives is replete with former players who supposedly “know the game” having been there and played it. For a long time this kind of claim to expertise ruled the day, until the “stat-heads” came along and showed that much of what the “baseball people” thought didn’t quite work that way. Hall of Fame player Joe Morgan is celebrated by some as one of the best baseball commentators for his work on ESPN’s Sunday Night baseball. He also has inspired a fantastic blog that revels in point out how foolish most of his comments are when subjected to statistical analysis. Can Bill James, who never played the game, know more about baseball than someone with a Hall of Fame career?

Back to Iraq and the election—can John McCain really “know more” about the war because he 1) served in the military and 2) has visited Iraq many times when compared to Obama who has 1) not served and 2) visited rarely, and not for some time? Does being there really matter? Can one develop and claim expertise from non-experiential research?

Now, before this becomes a stats vs. anthropology argument (as the baseball analogy might portend), I want to suggest that both McCain and Obama have an important point. It is important to be there, but being there alone does not necessarily mean that your evidence, evaluation, and conclusion is any more valid. I’m reminded of an ISA panel I attended, maybe this year, where a number of critical security scholars were discussing the state of the discipline, and one prominent senior member of the panel talked about how important it was to ‘be there,’ to get the mood of the place, to write from that perspective.

Just being there, however, doesn’t mean that you have greater access to “fact” or “Truth” than anyone else. Take McCain in Iraq. He goes on a CODEL. He meets with select troops, who are probably on their best behavior for the famous Senator. He meets with members of the Iraqi government, who probably ask him for stuff, hoping to work the levels of US political power. He tours a marketplace, with a brigade providing security. There’s no way he can get “out” to see the rest of the country, there’s no way he can meet with many of the forward deployed troops out on the FOB—a more representative sample is simply impossible for him. Its just too dangerous (and rightly, not worth the risk to him). Is it important that he goes? Sure. Does this mean that his assessment and evaluation of Iraq is fundamentally superior to Obama’s? Not really.

So, when McCain criticizes Obama, and when those in the “field” criticize those back at the desk, and those who played criticize those who haven’t, they have a very important point to make. Being there does shape and deepen your analysis about certain things in certain ways. But not everything, and not always in the most appropriate way. Just because you were there doesn’t mean you saw the whole picture while you were. Just because you were there doesn’t mean you paid attention to the things you later comment on as an expert. Just because you were “there” doesn’t mean that you are able to understand how “there” is now relevant “here.”

In the social sciences, we arbitrate these disputes with our methodology. We ask—what did you do while you were there, in the field? What did you read while sitting in your office? The methodology gives us a standard for what counts as enough knowledge about a thing or place on which to offer meaningful analysis.

In the campaign, it looks like we might have “We’re winning, can’t you see?” vs. “You were wrong then and you’re wrong now.”

« Older posts Newer posts »

© 2020 Duck of Minerva

Theme by Anders NorenUp ↑