Open data horror stories: notes for an #ODCamp pitch/session

Post: 20 October 2017

Are there risks attached to open data publication, beyond those that arise from normal publication of information on the web? What’s the risk assessment and governance framework for open data?

Or to put it more bluntly: how badly can open data go wrong? Let’s have a discussion and crowdsource some examples.

Below are a few that are already on my radar.

Open publication of spending data by councils has led to a series of data protection incidents:

Dumfries and Galloway (reported in 2015)
Sutton (reported in 2015)
Portsmouth (reported in 2016)
West Sussex (reported in October 2017)

In 2014 DVLA’s new online Vehicle Tax Check service had to be amended after activists pointed out that people could use it find out if their neighbours received tax concessions based on mobility-related disability. Not exactly open data, but illustrative; imagine if that had been a bulk release under an open licence.

In May 2016 an NHS trust was fined £185,000 after it mistakenly published data about 6,574 members of staff in a spreadsheet of equality and diversity metrics.

In May 2017 a “routine security review” discovered that a file containing names, email addresses and hashed passwords of registered Data.go.uk users had been publicly accessible online since July 2015.

Earlier this year Land Registry finally admitted that its widely used Price Paid Data, released as open data in 2013, contains non-open address data. This followed an earlier admission that ICO had advised Land Registry in 2012 that house prices were probably personal data. What’s the potential fallout from that for long-term re-users of Price Paid Data?

And a couple of examples from the US:

In 2012 the City of Philadelphia, following an executive order on open data signed by the city’s mayor, launched an interactive website with several datasets including details of gun permit appeals. Two years later the city paid $1.4 million to settle a class action lawsuit from gun permit holders named in the data.

In 2014 New York City released an open dataset of 173m taxi journeys. But the data hadn’t been properly anonymised – analysts were able to extract personally identifiable information about drivers.

Most of the above examples relate to inadvertent exposure of personal data, and arguably the same risks arise from access to information releases under FOI/EIR. But can open data publication exacerbate those risks?

Open data is bulk, perpetual, and specifically intended for distribution and re-use. By design there’s no way to practically track down and notify all the recipients of an open dataset. (Is Article 19 of GDPR relevant here?)

Some open data is potentially viral. What are the liability implications of unrecognised releases of not only personal data but data that could undermine environmental protection, breach confidentiality, or assist criminal acts?

If these are real concerns, what’s our best practice for managing and risk assessing open data publication? And what is the responsibility of publishers to re-users of open data, who may not have sufficient information about how the data was produced to judge those risks themselves?