This post is an update on DLUHC's efforts to add Unique Property Reference Numbers (UPRNs) to bulk data for domestic Energy Performance Certificates (EPCs).
The EPC data contains energy performance ratings for residential properties in England and Wales, along with a range of building attributes related to energy efficiency. UPRNs are identifiers derived from Ordnance Survey's AddressBase.
Adding UPRNs makes the EPC data more useful because records can be geolocated at address level. UPRNs enable grouping of EPC records that relate to the same property. UPRN also make it easier to combine EPC data with other datasets that contain UPRNs.
I wrote about this DLUHC work in March 2022. At that time, DLUHC had released bulk data for EPCs lodged to the end of 2021, and had added UPRNs to 92.4% of the records.
DLUHC has recently released updated bulk data for EPCs lodged to the end of June 2022. The number of EPC records with UPRNs has increased to 97.8%.
Progress in matching EPCs to UPRNs
The chart below shows the percentage of EPC records with a UPRN, for EPCs lodged in each quarter from Q4 2008 to Q2 2022.
For comparison, I have also shown the percentages from the earlier release of data to Q4 2021. DLUHC has added UPRNs to more than 1.27 million EPC records that lacked UPRNs in the earlier release.
The chart below is an alternative view of the same comparison between releases, showing percentages of EPC records lodged in each month that do not have a UPRN in the bulk data.
Why are there still missing UPRNs?
We can see from the charts that DLUHC has reduced the percentage of missing UPRNs across the dataset, and from month to month the percentages in the older data are more consistent.
DLUHC has provided a general explanation of its approach to allocation of UPRNs. However, it is unclear what additional techniques DLUHC has applied to improve the allocation of UPRNs between the releases of EPC bulk data.
As mentioned in my earlier post, there are several common reasons why a UPRN might not have been allocated to an EPC record:
- the property no longer exists,
- the lodged address was incomplete or in a non-standard format, or
- the UPRN exists but is not yet in OS AddressBase.
Possibly, DLUHC has used older versions of AddressBase, or address data from other authoritative sources, to improve matching of EPC records where the addresses are non-standard, or to better identify properties that no longer exist.
It is also possible that the lag between creation of UPRNs for new dwellings and their availability in AddressBase has been reduced, though clearly that factor remains the primary reason for the higher percentages of missing UPRNs for EPC lodged in recent months.
The charts above suggest lower rates of success adding UPRNs to records for EPCs lodged in late 2020 and early 2021. However, that may be an anomaly introduced in production – perhaps two or more teams were working to allocate UPRNs to different cuts of the bulk data.
Has the quality of address matching changed?
In another earlier post, I provided examples of types of error I had found in DLUHC's allocation of UPRNs in EPC records.
Based on a comparison of releases of the bulk data, DLUHC has not made any changes in the Q4 2022 release to UPRNs that had already been added to records in the Q1 2021 release. In the latest release, DLUHC has added UPRNs to EPC records that lacked them, but has not corrected any UPRNs allocated previously.
It is possible that DLUHC has improved its rates of matching UPRNs to EPCs by changing the parameters in its address-matching algorithm/s. If DLUHC is fuzzy-matching the address strings in a more lenient manner, that would increase the number of correct matches – but would also increase the number of 'false positives' i.e. records where the algorithm has allocated the wrong UPRN.
In the latest release of the EPC data, I have noticed that quite a few rental flats in the same building have been allocated the same UPRN – even though each flat has a separate EPC and, according to AddressBase, its own UPRN. For example, Flats A through E at St Giles Court on Southampton Street in Reading RG1 2QL have been allocated UPRN 310086374. That UPRN actually only applies to Flat D.
There may also be some cases where DLUHC has matched a group of EPC records to a parent UPRN, where individual UPRNs exist and would be more accurate. For example, Ivy House at 279 Seven Sisters Road in Hackney N4 2DE is a property containing more than 90 studio rooms over two floors. Each room has its own EPC record, and its own UPRN in AddressBase, but DLUHC has allocated a single UPRN (for 279 Seven Sisters Road) to all of the EPC records.
In practice, avoiding errors in the allocation of UPRNs to the EPC bulk data is likely to be an insurmountable challenge. The benefits of adding correct UPRNs to the vast majority of records far outweigh the risks from having some incorrect UPRNs in the data. However, re-users of the EPC data should be mindful of these data quality issues, as they will affect some use cases more than others.