Organisational Readiness and Perceptions of Synthetic Data Production and Dissemination in the UK: Qualitative Data, 2024-2025

Haaker, Maureen and Magder, Cristina and Zahid, Hina and Kasmire, Jools and Ogwayo, Melissa (2025). Organisational Readiness and Perceptions of Synthetic Data Production and Dissemination in the UK: Qualitative Data, 2024-2025. [Data Collection]. Colchester, Essex: UK Data Service. 10.5255/UKDA-SN-857983

The growing discourse around synthetic data underscores its potential not only in addressing data challenges in a fast-paced changing landscape but for fostering innovation and accelerating advancements in data analytics and artificial intelligence. From optimising data sharing and utility (James et al., 2021), to sustaining and promoting reproducibility (Burgard et al., 2017) to mitigating disclosure (Nikolenko, 2021) synthetic data has emerged as a solution to various complexities of the data ecosystem.

The project proposes a mixed-methods approach and seeks to explore the operational, economic, and efficiency aspects of using low-fidelity synthetic data from the perspectives of data owners and Trusted Research Environments (TREs).

The essence of the challenge is in understanding the tangible and intangible costs associated with creating and sharing low-fidelity synthetic data, alongside measuring its utility and acceptance among data producers, data oweners and TREs. The broader aim of the project is to foster a nuanced understanding that could potentially catalyse a shift towards a more efficient and publicly acceptable model of synthetic data dissemination.

This project is centred around three primary goals:
1. to evaluate the comprehensive costs incurred by data owners and TREs in the creation and ongoing maintenance of low-fidelity synthetic data, including the initial production of synthetic data and subsequent costs;
2. to assess the various models of synthetic data sharing, evaluating the implications and efficiencies for data owners and TREs, covering all aspects from pre-ingest to curation procedures, metadata sharing, and data discoverability; and
3. to measure the efficiency improvements for data owners and TREs when synthetic data is available, analysing impacts on resources, secure environment usage load, and the uptake dynamics between synthetic and real datasets by researchers.

Commencing in March 2024, the project will begin with stakeholder engagement, forming an expert panel and aligning collaborative efforts with parallel projects. Following a robust literature review, the project will embark on a methodical data collection journey through a targeted survey with data creators, case studies with d and data owners and providers of synthetic data, and a focus group with TRE representatives. The insights collected from these activities will be analysed and synthesized to draft a comprehensive report delineating the findings and sensible recommendations for scaling up the production and dissemination of low-fidelity synthetic data as applicable.

The potential applications and benefits of the proposed work are diverse. The project aims to provide a solid foundation for data owners and TREs to make informed decisions regarding synthetic data production and sharing. Furthermore, the findings could significantly influence future policy concerning data privacy thereby having a broader impact on the research community and public perception. By fostering a deeper understanding and establishing a dialogue among key stakeholders, this project strives to bridge the existing knowledge gap and push the domain of synthetic data into a new era of informed and efficient usage. Through meticulous data collection and analysis, the project aims to unravel the intricacies of low-fidelity synthetic data, aiming to pave the way for an efficient, cost-effective, and publicly acceptable framework of synthetic data production and dissemination.

Data description (abstract)

This collection comprises of interview and focus group data gathered in 2024-2025 as part of a project aimed at investigating how synthetic data can support secure data access and improve research workflows, particularly from the perspective of data-owning organisations.

The interviews included 4 case studies of UK-based organisations who had piloted work generating and disseminating synthetic datasets, including the Ministry of Justice, NHS England, the project team working in partnership with the Department for Education, and Office for National Statistics. It also includes 2 focus groups with Trusted Repository Environment (TRE) representatives who had published or were considering publishing synthetic data.

The motivation for this collection stemmed from the growing interest in synthetic data as a tool to enhance access to sensitive data and reduce pressure on Trusted Research Environments (TREs). The study explored organisational engagement with two types of synthetic data: synthetic data generated from real data, and “data-free” synthetic data created using metadata only.

The aims of the case studies and focus groups were to assess current practices, explore motivations and barriers to adoption, understand cost and governance models, and gather perspectives on scaling and outsourcing synthetic data production. Conditional logic was used to tailor the survey to organisations actively producing, planning, or not engaging with synthetic data.

The interviews covered 5 key themes: organisational background; Infrastructure, operational costs, and resourcing; challenges of sharing synthetic data; benefits and use cases of synthetic data; and organisational policy and procedures.

The data offers exploratory insights into how UK organisations are approaching synthetic data in practice and can inform future research, infrastructure development, and policy guidance in this evolving area.

The findings have informed recommendations to support the responsible and efficient scaling of synthetic data production across sectors.

Data creators:
Creator Name Affiliation ORCID (as URL)
Haaker Maureen University of Essex http://orcid.org/0000-0002-9487-5590
Magder Cristina University of Essex https://orcid.org/0000-0001-5937-8188
Zahid Hina University of Essex http://orcid.org/0000-0002-0669-9911
Kasmire Jools University of Manchester http://orcid.org/0000-0003-2684-6330
Ogwayo Melissa University of Essex https://orcid.org/0009-0003-2127-6196
Sponsors: ESRC
Grant reference: ES/Z502467/1
Topic classification: Science and technology
Education
Keywords: DATA, DISSEMINATION OF INFORMATION, DATA PRIVACY, DATA PROTECTION, RESEARCH METHODOLOGY, SCIENTIFIC INNOVATION, COSTS, ORGANIZATIONS
Project title: Balancing the data scales: A cost-benefit analysis of low-fidelity synthetic data for data owners and providers
Grant holders: Cristina Magder, Julia Kasmire, Hina Zahid, Maureen Haaker
Project dates:
FromTo
7 April 20246 April 2025
Date published: 04 Aug 2025 08:27
Last modified: 09 Sep 2025 10:20

Available Files

Data

Documentation

Read me

Downloads

data downloads and page views since this item was published

View more statistics

Altmetric

Edit item (login required)

Edit Item Edit Item