Organisational Readiness and Perceptions of Synthetic Data Production and Dissemination in the UK: Qualitative Data, 2024-2025

Haaker, Maureen and Magder, Cristina and Zahid, Hina and Kasmire, Jools and Ogwayo, Melissa (2025). Organisational Readiness and Perceptions of Synthetic Data Production and Dissemination in the UK: Qualitative Data, 2024-2025. [Data Collection]. Colchester, Essex: UK Data Service. 10.5255/UKDA-SN-857983

The growing discourse around synthetic data underscores its potential not only in addressing data challenges in a fast-paced changing landscape but for fostering innovation and accelerating advancements in data analytics and artificial intelligence. From optimising data sharing and utility (James et al., 2021), to sustaining and promoting reproducibility (Burgard et al., 2017) to mitigating disclosure (Nikolenko, 2021) synthetic data has emerged as a solution to various complexities of the data ecosystem.

The project proposes a mixed-methods approach and seeks to explore the operational, economic, and efficiency aspects of using low-fidelity synthetic data from the perspectives of data owners and Trusted Research Environments (TREs).

The essence of the challenge is in understanding the tangible and intangible costs associated with creating and sharing low-fidelity synthetic data, alongside measuring its utility and acceptance among data producers, data oweners and TREs. The broader aim of the project is to foster a nuanced understanding that could potentially catalyse a shift towards a more efficient and publicly acceptable model of synthetic data dissemination.

This project is centred around three primary goals:
1. to evaluate the comprehensive costs incurred by data owners and TREs in the creation and ongoing maintenance of low-fidelity synthetic data, including the initial production of synthetic data and subsequent costs;
2. to assess the various models of synthetic data sharing, evaluating the implications and efficiencies for data owners and TREs, covering all aspects from pre-ingest to curation procedures, metadata sharing, and data discoverability; and
3. to measure the efficiency improvements for data owners and TREs when synthetic data is available, analysing impacts on resources, secure environment usage load, and the uptake dynamics between synthetic and real datasets by researchers.

Commencing in March 2024, the project will begin with stakeholder engagement, forming an expert panel and aligning collaborative efforts with parallel projects. Following a robust literature review, the project will embark on a methodical data collection journey through a targeted survey with data creators, case studies with d and data owners and providers of synthetic data, and a focus group with TRE representatives. The insights collected from these activities will be analysed and synthesized to draft a comprehensive report delineating the findings and sensible recommendations for scaling up the production and dissemination of low-fidelity synthetic data as applicable.

The potential applications and benefits of the proposed work are diverse. The project aims to provide a solid foundation for data owners and TREs to make informed decisions regarding synthetic data production and sharing. Furthermore, the findings could significantly influence future policy concerning data privacy thereby having a broader impact on the research community and public perception. By fostering a deeper understanding and establishing a dialogue among key stakeholders, this project strives to bridge the existing knowledge gap and push the domain of synthetic data into a new era of informed and efficient usage. Through meticulous data collection and analysis, the project aims to unravel the intricacies of low-fidelity synthetic data, aiming to pave the way for an efficient, cost-effective, and publicly acceptable framework of synthetic data production and dissemination.

Data description (abstract)

This collection comprises of interview and focus group data gathered in 2024-2025 as part of a project aimed at investigating how synthetic data can support secure data access and improve research workflows, particularly from the perspective of data-owning organisations.

The interviews included 4 case studies of UK-based organisations who had piloted work generating and disseminating synthetic datasets, including the Ministry of Justice, NHS England, the project team working in partnership with the Department for Education, and Office for National Statistics. It also includes 2 focus groups with Trusted Repository Environment (TRE) representatives who had published or were considering publishing synthetic data.

The motivation for this collection stemmed from the growing interest in synthetic data as a tool to enhance access to sensitive data and reduce pressure on Trusted Research Environments (TREs). The study explored organisational engagement with two types of synthetic data: synthetic data generated from real data, and “data-free” synthetic data created using metadata only.

The aims of the case studies and focus groups were to assess current practices, explore motivations and barriers to adoption, understand cost and governance models, and gather perspectives on scaling and outsourcing synthetic data production. Conditional logic was used to tailor the survey to organisations actively producing, planning, or not engaging with synthetic data.

The interviews covered 5 key themes: organisational background; Infrastructure, operational costs, and resourcing; challenges of sharing synthetic data; benefits and use cases of synthetic data; and organisational policy and procedures.

The data offers exploratory insights into how UK organisations are approaching synthetic data in practice and can inform future research, infrastructure development, and policy guidance in this evolving area.

The findings have informed recommendations to support the responsible and efficient scaling of synthetic data production across sectors.

Data creators:

Creator Name	Affiliation	ORCID (as URL)
Haaker Maureen	University of Essex	http://orcid.org/0000-0002-9487-5590
Magder Cristina	University of Essex	https://orcid.org/0000-0001-5937-8188
Zahid Hina	University of Essex	http://orcid.org/0000-0002-0669-9911
Kasmire Jools	University of Manchester	http://orcid.org/0000-0003-2684-6330
Ogwayo Melissa	University of Essex	https://orcid.org/0009-0003-2127-6196

Sponsors:

ESRC

Grant reference:

ES/Z502467/1

Topic classification:

Science and technology
Education

Keywords:

DATA, DISSEMINATION OF INFORMATION, DATA PRIVACY, DATA PROTECTION, RESEARCH METHODOLOGY, SCIENTIFIC INNOVATION, COSTS, ORGANIZATIONS

Project title:

Balancing the data scales: A cost-benefit analysis of low-fidelity synthetic data for data owners and providers

Grant holders:

Cristina Magder, Julia Kasmire, Hina Zahid, Maureen Haaker

Project dates:

From	To
7 April 2024	6 April 2025

Date published:

04 Aug 2025 08:27

Last modified:

09 Sep 2025 10:20

Coverage and Methodology

Collection period:

Date from:	Date to:
20 November 2024	31 January 2025

Country:

United Kingdom

Spatial unit:

No Spatial Unit

Data collection method:

This study employs a qualitative, collective case study methodology. Semi-structured interviews were conducted between November 2024 to January 2025 with senior representatives from each chosen case study organization, selected based on their roles in overseeing synthetic data provision. The selected cases represent a diverse range of approaches to synthetic data dissemination in the UK. Each case differs in terms of data creation processes, access conditions, and request mechanisms. The cases examined in this study include:
1. NHS England: This organization has piloted a project to generate synthetic data from the Hospital Episode Statistics (HES) dataset. An interface has been developed that enables users to create and access synthetic datasets independently. Additionally, three synthetic datasets are openly available for public download.
2. Ministry of Justice (MoJ): Through the Data First project, a collaboration with ADR UK, the MoJ provides synthetic data via the UK Data Service. This model facilitates registered access to synthetic datasets alongside supporting documentation.
3. Office for National Statistics (ONS): ONS, in collaboration with the Integrated Data Service (IDS), employs generative adversarial networks (GANs) to create and distribute synthetic data. This initiative gained recognition by winning the NIST 2018 Differential Privacy Synthetic Data Challenge, demonstrating an innovative model that integrates data owners and data users.
4. Department for Education (DfE): In partnership with University College London (UCL), DfE has developed a low-fidelity synthetic dataset based on the Longitudinal Educational Outcomes (LEO) dataset. This initiative is ongoing, with the synthetic dataset intended for future availability through the ONS Secure Research Service.

This study also employed a focus group with Trusted Research Environment (TRE) representatives to examine the operational dimensions of synthetic data usage within secure environments. The focus was on the practical implications of disseminating synthetic data, as well as the challenges and opportunities it presents for TREs. Participants were recruited via an email invitation distributed through TRE networks. A self-selection process was employed, ensuring that participants were actively involved in synthetic data creation, dissemination, and/or usage. The focus group, conducted on 11 December 2024, consisted of six participants representing a range of professional roles and levels of experience. The session was conducted virtually and lasted approximately 120 minutes.

Observation unit:

Organization, Group

Kind of data:

Text

Type of data:

Qualitative and mixed methods data

Resource language:

English

Access and Administration

Available Files

Downloads

data downloads and page views since this item was published

View more statistics

Altmetric

Related Resources

Website

Balancing the data scales: A cost-benefit analysis of low-fidelity synthetic data for data owners and providers

Edit item (login required)

Edit Item