Jun 23, 2014
If data wants to be free, then PovcalNet, the world’s leading dataset on global poverty, is happier today because it was recently made available for download in bulk by my guests on this week’s Wonkcast CGD research fellow Justin Sandefur and research assistant Sarah Dykstra. Scraping the data was no easy task: it required devising code that queried the database for one answer at a time, 23 million times, over nine weeks, then reassembling the 8 million resulting data points answers into a single dataset. They then posted the dataset and a related paper online for the use of researchers around the world.
Justin and Sarah tell me that were motivated to scrape the PovcalNet website in part because they needed the full dataset for their own research, and in part because they knew other researchers had a similar need. Lacking the full dataset, they and others previously had no option but to spend hours pointing and clicking, one number at a time, to get the specific information they needed. (The code needed to run the queries was beyond what we could manage here at CGD, so the pair turned to Sarah’s brother, independent programmer Benjamin Dykstra.)
Since individual data points were already online—albeit not in a readily accessible format—the project involved no “hacking.” I ask whether they tried first just asking the World Bank for the dataset. Justin explains that: "...the underlying raw data isn’t even available to many researchers within the Bank.”
I say that this surprises me, especially given the Bank’s open data policy.
“There’s a lively internal debate in the World Bank about whether or not this data should be public,” Justin tells me. “But not all data that the World Bank has are covered by the open data policy…it was pointed out to us that PovcalNet is not.”
The value of having the full dataset publicly available became evident soon after, when the International Comparison Project (ICP) released new Purchasing Power Parity (PPP) numbers—something that only happens every five to six years. Combining the scraped PovcalNet data with the newly updated PPP numbers, Justin and Sarah produced a startling new estimate: it seems that global poverty had fallen by half. Their blog post announcing this finding set of a fiery debate in the comments field, starting with comments from CGD non-resident fellow Martin Ravallion, who in a long career at the World Bank earned a reputation as one of the world’s leading experts on poverty measurement. These comments in turn led to revisions in the blog post.
Justin says that the entire process illustrates the importance of making research data publicly available.
“We’re living in a new era where there are a lot of people participating in this analysis and this conversation, and a million eyeballs can find lots of mistakes.” Justin says. “So let’s put all the data and the code in the public domain and open up that conversation.”
So, what exactly was the World Bank’s response to their efforts and the resulting new poverty estimates?
“Annoyance is probably the right word,” Justin says. “The stance of the research department now seems to be, reading between the lines, that ‘we don’t really trust these [new PPP] numbers, and we’ll reserve judgment on whether we should use them yet.’”
It’s an exciting story, with some unexpected twists and turns. To hear it, and learn what Justin and Sarah have planned next, tune in to the full Wonkcast.