Evaluating Re-Identification Risks scores in Publicly Available Clinical Trial Datasets: Insights and Implications

Aryelly Rodriguez*, Linda Jane Williams, Steff C Lewis, Pamela Sinclair, Sandra Eldridge, Tracy Jackson, Christopher J Weir

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Background
The motivations to share anonymised datasets from clinical trials within the scientific community are increasing. Many anonymised datasets are now publicly available for secondary research. However, it is uncertain whether they pose a privacy risk to the involved participants.
Methods
We located a broad sample of publicly available, de-identified/anonymised randomised clinical trial datasets from human participants and contacted their owners to request access, following their local procedures. We classified personal data within these datasets, including unique direct identifiers such as date of birth and other personal data that, on their own, does not identify an individual but may do so when combined with each other, such as sex, age and race (indirect identifiers). Combining indirect identifiers forms strata, and adding more identifiers increases granularity by dividing the data into a larger number of smaller strata. The re-identification risk score equations evaluate membership in these strata in three ways: first, by measuring the proportions of participants in strata above predetermined risk threshold levels (Ra); second, by locating the smallest stratum (Rb); third, by estimating the average membership across all strata in a dataset (Rc). The risk scores range from 0 (lowest risk) to 1 (highest risk); they do not aim to re-identify individuals in the datasets and are used for routinely collected health records. If a dataset contained a direct identifier, it automatically scored 1 in all metrics. Conversely, if a dataset contained no direct or up to one indirect identifier, it automatically scored 0 in all metrics. Finally, we explored which characteristics of the datasets were associated with the risk scores and compared the risk scores and their usability.
Results
Seventy datasets from 14 data sources were analysed. Thirty-one datasets were shared with minimal restrictions (open access), while 39 were shared with varying levels of restrictions before access was granted (controlled access). Datasets had, on average, four identifiers and mean risk scores ranging from 0.47 to 0.91. The most common pieces of information present in the datasets that, when combined, may indirectly identify a participant were sex (80%) and age (72.9%).
Conclusions
This study confirms that clinical trial datasets are rich in personal details and that using re-identification risk scores as a measure of this richness is feasible. These scores could inform the anonymisation process of clinical trials datasets regarding their level of granularity prior to releasing them for secondary research. We propose a strategy for employing these scores in the decision-making process for releasing clinical trials datasets.
Original languageEnglish
Number of pages18
JournalClinical Trials
Early online date22 Aug 2025
DOIs
Publication statusE-pub ahead of print - 22 Aug 2025

Keywords / Materials (for Non-textual outputs)

  • Clinical Trials as Topic/methods
  • Data anonymisation
  • Re-identification
  • De-identification
  • Data sharing
  • Re-identification risk

Fingerprint

Dive into the research topics of 'Evaluating Re-Identification Risks scores in Publicly Available Clinical Trial Datasets: Insights and Implications'. Together they form a unique fingerprint.

Cite this