As a numbers guy, John Abowd is a natural for unlocking the secrets of big data. But as chief scientist for the US Census Bureau, he finds himself in the unusual position of trying to conceal as well as to reveal. His work can offer wider lessons in how to address privacy concerns in an age of ubiquitous information.
Required by the US Constitution and conducted every 10 years since then-Secretary of State Thomas Jefferson led the first one in 1790, the US Census is one of the biggest and most-influential undertakings of the government. It dictates how many congressional seats each state has and the allocation of as much as $1.5 trillion a year in federal spending. The private sector mines its data for everything from marketing to specific consumer segments to deciding where to locate new restaurants, bank branches and factories.
The census gathers information including the age, gender, race, and homeowner or renter status for every American in tiny geographical blocks of about 30 residents, on average. The bureau is required by law to safeguard that knowledge, but the ability to compare its anonymized data with the troves of information held by social media giants, lenders and the like makes that increasingly difficult. After Abowd took up his post in 2016, he had staff try to effectively hack the 2010 census. The result was sobering: They could precisely identify 17 percent of the respondents, or 52 million people.
In a bid to avoid the census becoming the mother of all data breaches, Abowd is fighting back with differential privacy, or DP. The technique infuses error, or “noise,” in large datasets in an effort to maintain both the privacy of individuals and the usefulness of the data in aggregate. It has been used by big tech companies to anonymize data that’s shared for research and other purposes. MIT Technology Review has labeled it one of the top 10 technologies likely to make a breakthrough this year. Success is crucial to maintaining public confidence and willingness to participate in the census. Yet even with DP, the bureau expects to publish less data from the 2020 Census than previous exercises. States and cities that depend on census data worry that it won’t be accurate enough for small geographies to fairly allocate sales tax revenue or make disaster preparedness plans for the elderly and other vulnerable populations.
The 2020 census kicked into high gear in April and is being conducted largely online for the first time, but as of April 19 the public response rate was only a hair above 50 percent. Abowd spoke by telephone with Douglas Elliott, a partner at Oliver Wyman and a co-lead of the Oliver Wyman Forum’s Future of Data initiative.
Douglas Elliott: One of the very first things you did after coming to the Census Bureau was to focus on privacy. Why?
John Abowd: Most official statisticians don’t think you can release very much data because they disclose too much information. These are data where you combine sources from different administrative records and surveys. On the other side there’s this tradition in our national census of essentially publishing every cross tabulation you can imagine: 150 billion statistics from the 2010 census.
There was noise infused in the 2010 census data. Was it enough? If you do it the old-fashioned way, you have no idea. Newer techniques allow you to say that if you have to publish three billion tabulations to do redistricting, this is the privacy cost associated with that.
Were you surprised by the extent to which the personal information from the 2010 census could be reconstructed?
Abowd: I was not surprised that the 2010 census was vulnerable. I was pretty sure we had the expertise to reconstruct the database, and we did. I started showing the results to my colleagues and the scales fell from their eyes. Whether you think 52 million people is a lot or a little out of 309 million people depends entirely on your point of view, but there’s no denying that number gets your attention.
You make a point that there’s a trade-off between the effectiveness of using the data and privacy protection. How do you decide where to draw the line?
Abowd: We explicitly instructed the team not to make that decision but to design a technology that could be explained to the chief stewards, the senior executive staff of the Census Bureau. I usually show the tradeoff between privacy loss and statistical accuracy as a curve representing an efficient frontier. You can’t get to perfect privacy and perfect accuracy, but you can easily be inside the frontier, wasting statistical accuracy or privacy loss. So we charged the team with developing a system that was as efficient as our resources could make it. It is a matter of tuning the algorithms so they don’t waste privacy loss.
How did the stewards end up making the decision?
Abowd: They haven’t yet. We tested our provisional disclosure avoidance system for the 2020 census on the 2010 data and presented it to the bureau’s Scientific Advisory Committee. Their reaction was, “Oh my God, this is incredibly hard, and we don’t agree.” That’s because it’s a social choice problem. There was some distress in the user committee because everybody could find a use case that the test data wasn’t very good on. We’ve had to form groups of experts to help us tune the algorithms and select the use cases that are going to get the accuracy. My colleagues are waiting for that process to produce some recommendations.
Is this one of those things where, like with money, there’s so many dollars to spend on federal programs and everybody has different priorities?
Abowd: Certain use cases have bubbled to the top. One of them is population count. Users of the data tend to think of population count, which is official, as being exact. But it was never the latter. The more granular the geographic entity gets – a block, for example – the more noisy that population estimate has to be to protect privacy. It’s just like the problem that a cell phone company has. There are plenty of examples of people releasing cell phone data where the key feature is the geospatial location, usually of the caller. And it is widely acknowledged that you have to noise up that location to protect the privacy.
What advice do you have for other people facing these types of issues?
Abowd: The biggest difference between the census and a new app from say, an internet giant, is that they’re not starting from a base of established users familiar with the historical data products. So they have the opportunity to ask the question, what’s the use case? Who are the priority users of this product, and what features do they use to make their decisions? That will let you make much more intelligent choices. You don’t try to be all data to all people.
When there is a series of stakeholders intersecting census data with other datasets, how do you think about the tradeoffs?
Abowd: The person who implements privacy protections for OnTheMap, a Census web tool showing where workers are employed and where they live, teases me whenever he sees me: “All right, so you’ve got the 2020 census locked up. Have you solved the problem of what we’re going to do when we link it up to all the other data bases?”
We can at best implement partial solutions right now. I think the same is happening inside the tech companies. One technology firm’s new product, which allows social science researchers to access its anonymized user data, was going to be pretty elaborate DP everything. And when it finally rolled out, it was one very carefully delineated set of tabulations passed through one very specific set of DP mechanisms. I talked to them at both ends of that and said, ‘Watch out, the problem gets hairy a lot faster than you realize.’ You get overconfident because you sit with the toy models and they work pretty well on toy datasets. Then you scale them and sparsity, the bane of statisticians’ existence, rears its ugly head and puts limitations on your ability to do sensible statistics.
There are initiatives to use things like mobile location data to contain COVID-19. Can your exercise inform us how to balance those tools with privacy protections?
Abowd: I can’t speak to COVID-19, but let me answer the more general question. When you hear a domain specialist plead for granularity along a particular dimension, be it location, diagnosis, or, in the case of medicine, identity, you need to ask what’s the statistical use, and then design a privacy system to meet that. The administrative use is a fundamentally different question. Identifiable data are used all the time. You don’t randomly give welfare benefits.
In some cases, a breach of privacy for statistical uses would be unacceptable but a breach of the same privacy for socially accepted administrative purposes wouldn’t be. (The Census Bureau is constrained by law to operate exclusively on the statistical use side.) It’s not cut and dried, and it’s not easy. And I think it’s clear that there are very different decisions being made around the world.