Data Street Cred

2 Comments

Guest on 2 April 2014 at 01.53 EDT

This blog has become The Duck of Saideman in recent days. I like it.

Cyrus on 2 April 2014 at 13.39 EDT

Perhaps both a strength and weakness of the GDELT dataset is its enormity. Because it contains so much information on so many different actors in so many different sorts of situations across so many years a significant amount of subsetting, merging, recoding, normalizing -basically massaging of the data -is required, which makes it is really difficult to speak of any one ‘GDELT dataset’. The more familiar datasets in use in the discipline such as the ICB, COW, MID, UCDP, ACLED, etc. (note both event and non event datasets in the list) are well structured and maintained -at best users of these datasets may subset based on temporal or regional scope or perhaps may use only a few of these variables and merge them with unit attribute variables from other datasets (such as horizontal inequality or infant mortality). Reviewers will see these datasets used over and over in largely a similar fashion. But the automated coding based on news reports for GDELT renders time-series analysis difficult without normalization, moreover GDELT’s TABARI machine coder regularly errs in the georeferencing of events, which complicates the ‘massaging of the data’ even more. To make matters worse, there is such an abundant number of missing variables in the dataset that its use as a global dataset is questionable regardless of which interpolation or imputation techniques are used. GDELT’s utility is in its use as a base from which to extract and create subsequent datasets. Interested in refugees and conflict? Subset the dataset to only include acts of material conflict involving refugees! Want to map out the the number atrocities committed by paramilitary forces in the Syrian Civil War and determine whether its significantly different than those committed by Syrian military? Subset it to only include atrocities committed by those two groups in Syria. Ultimately if GDELT ever does find a way to rid the controversy and stigma attached to it, it will have an even greater challenge to overcome, a similar challenge anyone has in using a self-coded dataset -convincing reviewers and the discipline at large that the data is what it really claims to be.

Guest on 2 April 2014 at 01.53 EDT

This blog has become The Duck of Saideman in recent days. I like it.
Cyrus on 2 April 2014 at 13.39 EDT

Perhaps both a strength and weakness of the GDELT dataset is its enormity. Because it contains so much information on so many different actors in so many different sorts of situations across so many years a significant amount of subsetting, merging, recoding, normalizing -basically massaging of the data -is required, which makes it is really difficult to speak of any one ‘GDELT dataset’. The more familiar datasets in use in the discipline such as the ICB, COW, MID, UCDP, ACLED, etc. (note both event and non event datasets in the list) are well structured and maintained -at best users of these datasets may subset based on temporal or regional scope or perhaps may use only a few of these variables and merge them with unit attribute variables from other datasets (such as horizontal inequality or infant mortality). Reviewers will see these datasets used over and over in largely a similar fashion. But the automated coding based on news reports for GDELT renders time-series analysis difficult without normalization, moreover GDELT’s TABARI machine coder regularly errs in the georeferencing of events, which complicates the ‘massaging of the data’ even more. To make matters worse, there is such an abundant number of missing variables in the dataset that its use as a global dataset is questionable regardless of which interpolation or imputation techniques are used. GDELT’s utility is in its use as a base from which to extract and create subsequent datasets. Interested in refugees and conflict? Subset the dataset to only include acts of material conflict involving refugees! Want to map out the the number atrocities committed by paramilitary forces in the Syrian Civil War and determine whether its significantly different than those committed by Syrian military? Subset it to only include atrocities committed by those two groups in Syria. Ultimately if GDELT ever does find a way to rid the controversy and stigma attached to it, it will have an even greater challenge to overcome, a similar challenge anyone has in using a self-coded dataset -convincing reviewers and the discipline at large that the data is what it really claims to be.

The Duck of Minerva

Data Street Cred

Steve Saideman

Steve Saideman

2 Comments

Data Street Cred

Steve Saideman

share this post

Steve Saideman

2 Comments