I am not known for being a statistics whiz. I have published quantitative work, but I am seen, rightly so, as more comfortable with qualitative work, comparing apples and oranges. Still, I had the gumption to offer advice on twitter about data today. What and why?
GDELT was a new dataset that seemed to promise heaps of utility to those who wanted to study event data–which are counts of particular events of interest and handy for analyzing events over time. It came under fire recently for a variety of reasons. I did not use the dataset nor do I work with event data, so I am not in any position to judge the dataset itself.
However, I do have experience of working with data that has been criticized. The Minorities at Risk Project was an effort originally to assess which ethnic groups might be at risk of violence. The collection selected the groups that were either mobilized or already facing discrimination. As a result, it was not so good for questions related to why groups become mobilized or face discrimination since those that are not those things were left out. For the questions I tended to ask, it was less problematic–which groups at risk tend to get more or less international support (least problematic), which groups at risk were more likely to be secessionist or irredentist (a bit problematic), which institutions are associated with more or less ethnic conflict (more problematic).
Once the dataset was criticized by some big names, pop. It got harder to publish stuff as reviewers scoffed at any findings emanating from the dataset. The good news for MAR fans is that this led to an NSF project that funded efforts to address the selection bias problem. The first piece addressing the new dataset has recently been accepted. The second piece is in the works, and now I am back in the business of pondering the relationships between institutions and ethnic conflict (the delays are my fault now for being distracted by other projects).
Anyhow, the relevance of my experience is this: GDELT is now tarnished, which means it will be harder for stuff to get published as reviewers will be harder to convince. The peer review process depends on convincing reviewers of the importance of the question, the soundness of the research design, the quality of the data, the interpretation of the findings, and so on. Given my experience, I expect that using GDELT will be risky if you want publications in the near term. Over time, the problems might be fixed or might not be that bad. But for now, its reputation is lousy. No, this post is not going to do the dirty work of making its reputation bad. That much has already been achieved. I am just making it clear that it does not matter if one believes the data to be spiffy or not, but what lies in the minds of reviewers.
So, user beware. Here be dragons.
This blog has become The Duck of Saideman in recent days. I like it.
Perhaps both a strength and weakness of the GDELT dataset is its enormity. Because it contains so much information on so many different actors in so many different sorts of situations across so many years a significant amount of subsetting, merging, recoding, normalizing -basically massaging of the data -is required, which makes it is really difficult to speak of any one ‘GDELT dataset’. The more familiar datasets in use in the discipline such as the ICB, COW, MID, UCDP, ACLED, etc. (note both event and non event datasets in the list) are well structured and maintained -at best users of these datasets may subset based on temporal or regional scope or perhaps may use only a few of these variables and merge them with unit attribute variables from other datasets (such as horizontal inequality or infant mortality). Reviewers will see these datasets used over and over in largely a similar fashion. But the automated coding based on news reports for GDELT renders time-series analysis difficult without normalization, moreover GDELT’s TABARI machine coder regularly errs in the georeferencing of events, which complicates the ‘massaging of the data’ even more. To make matters worse, there is such an abundant number of missing variables in the dataset that its use as a global dataset is questionable regardless of which interpolation or imputation techniques are used. GDELT’s utility is in its use as a base from which to extract and create subsequent datasets. Interested in refugees and conflict? Subset the dataset to only include acts of material conflict involving refugees! Want to map out the the number atrocities committed by paramilitary forces in the Syrian Civil War and determine whether its significantly different than those committed by Syrian military? Subset it to only include atrocities committed by those two groups in Syria. Ultimately if GDELT ever does find a way to rid the controversy and stigma attached to it, it will have an even greater challenge to overcome, a similar challenge anyone has in using a self-coded dataset -convincing reviewers and the discipline at large that the data is what it really claims to be.