Too often we lose track of useful datasets when websites close, projects are disbanded, or simply because publishers have a newer version they want us to adopt and fail to recognise the value in maintaining a time series.
This post contains examples and tips for retrieving copies of old public datasets that have seemingly disappeared from the everyday web.
The UK Government Web Archive
National Archives stores snapshots of central government websites, as well as sites belonging to many other public bodies, in the UK Government Web Archive. This is our most valuable tool for recovering neglected public data.
NA's web archive is similar to the Internet Archive's Wayback Machine, but it's more methodically curated and tends to dig deeper into the content of sites. If a data file was linked directly from a webpage, there's a good chance NA can provide a copy of that file.
For example Defra publishes the last two years of Common Agricultural Policy (CAP) Payments data as Excel workbooks. Every year another year of data is added and an earlier year of data is removed.
But National Archives has copies of that page back to 2010. We can retrieve the Excel workbooks for earlier years – 2014 and 2015, 2012 and 2013, 2010 and 2011, 2008 and 2009.
Similarly, every month Companies House publishes its Free Company Data Product, a dataset of basic information about live companies on its register. The previous month's data files are removed.
(Or the links are removed. Sometimes Companies House leaves earlier versions on the server, so you can try playing with the URLs.)
National Archives doesn't save copies of the Companies House website often enough to capture every monthly release of the data, but we can still use the web archive to retrieve dozens of earlier versions back to 2012.
Also once a month BEIS publishes the latest extract from its Renewable Energy Planning Database (REPD), and removes the previous version.
Earlier versions are available from BEIS "on request", but dozens of extracts are also available from the web archive – including the October 2014 release that was the last to include projects with a capacity between 10kW and 1MW.
For more old data on renewable energy projects, we can use the web archive to poke around the Renewable Energy STATisticS database (RESTATS), a DECC website that was closed in 2014.BEIS's current website includes an information sheet about a useful windspeed database originally developed by the Department of Trade and Industry (DTI) "at some point before 2001". The link to the database no longer works, but fortunately there's a full copy of the database available from NA's web archive.
Sometimes unpublishing public data can be controversial. DfE was criticised in 2014 for removing figures on Sure Start centres from a government website as part of a "data cleanse" exercise. However old lists of Sure Start centres remain available via the web archive.
Data lost to site design
Unfortunately some old data is more difficult to retrieve because the design of the sites on which it was originally hosted has confounded National Archives's ability to scrape the files (or at least serve them back to the public).
Ordnance Survey's Neighbourhood Statistics site, shut down unceremoniously in 2017, illustrates this problem. Most of the data files are inaccessible from the NA archive.
Copies of DCLG's late lamented Places Database are also difficult to navigate.
The Sitefinder Mobile Phone Base Station Database, once the subject of a long-running legal battle to secure its release under access to information laws, is not available from the NA archive due to access controls on the original Ofcom site. (However I've published a copy elsewhere.)
Data.gov.uk
Regular users of the Data.gov.uk catalogue will know that publishers don't always keep their metadata records up to date, and many links to datasets or landing pages no longer work.
Fortunately most Data.gov.uk records are periodically captured by National Archives, so it is often possible to trace a working link to a copy of the data – particularly as the old Data.gov.uk platform used to cache some data files.
For example the resource link in the live Data.gov.uk record for Bristol City Council's famous Abandoned Shopping Trolleys in Bristol rivers dataset is 404, but the CSV is still available from an archived copy of the page.
The historical Data.gov.uk dumps are useful for finding the original URLs of old metadata records, including for datasets that have been withdrawn from the catalogue.
Other useful archives
Some datasets are sufficiently well-recognised that third parties have made copies of older versions available even when the original publishers do not.
mySociety maintains a quite comprehensive cache of past releases of OS OpenData products.
The UK Data Service provides older versions of the ONS Postcode Directory, additional to those available from the ONS Geography portal.
The memory hole
It's useful to know how to find data that has been removed from its original published location and sent to archiving. However archived data is inherently more difficult to discover than data that is maintained in an obvious or recognisable location on the web.
For many users, data that is not discoverable via Google might as well not exist.
Publishers of open data should think carefully before they withdraw previously released datasets from general circulation. For analytic purposes in particular, the latest version of a dataset is not the only version that has potential for re-use.