That sunlight is the best disinfectant has become a truism in recent years, in science as much as in professional life in general. As concern has risen about the so-called reproducibility crisis in biomedicine, publication of raw data has been touted as the obvious solution.
Such an “open science” approach makes a lot of sense. At the very least, it makes selective reporting of results easier to detect, ending a practice?that has, in some cases, distorted science’s evidence base and impeded progress. Other harmful behaviours, such as hypothesising after results are known or even flat-out research fraud, may also be uncovered earlier.
Open science is not, however, the panacea that some imagine, and it comes with its own challenges.
Biomedical publications encompass a wide spectrum of research activities. At one end, studies using cell lines and animals present few ethical impediments to providing the raw data underlying figures, or the original images of, for example, western blots. At the other end, however, are epidemiological studies and clinical trials. These draw on data from healthy individuals and patients, who expect medical confidentiality to be respected.
In a recent letter in , a group of European researchers observed that sharing even apparently innocuous biological data, such as serum values, requires “extreme prudence” since, in combination with other patient data that may also be necessary for replication of the original analysis, such as age, gender and geographic location, it may permit the identification of individuals, violating data protection rules and ethical codes. “The larger the number of variables, the greater the risk with modern technology,” the authors note.
The risk is particularly acute regarding clinical trials that contain data at the level of individual participants. Yet pressure to open up even these datasets continues to increase. Late last year, for instance, the US National Institutes of Health released a draft policy on data-sharing that will require that all funded investigators make their datasets available to colleagues.
The US Environmental Protection Agency is also proposing to ignore any for major environmental regulations unless scientists disclose all of their raw data, including information obtained from individual medical records. Meanwhile, in the UK, a new “” for the reporting of clinical trials has recently been unveiled by the NHS’ Health Research Authority, following a recommendation by the Commons Science and Technology Select Committee.
Another concern about open science is that shared data might be used to distort the evidence. For instance, data from a vaccine trial might be misused by an antivaxxer organisation, or a study into the use of new electronic nicotine delivery devices might be cherrypicked by a tobacco company. However, a November event on open science convened by the US National Academy of Sciences – and attended by journal editors, funders, patients and administrators of data-sharing platforms – heard repeatedly that such concerns had not yet been realised.
This may be due to the gatekeeping of open clinical data applied by the sharing platforms.?,?for example, requires applicants to submit “a quality research proposal”, which must contain “all the information needed for clinical trial sponsors and independent review panels to make a decision about the request”.
Moreover, as suggested in the Lancet letter, it would not be too difficult to pseudonymise individual data in order to make it legally open and accessible. Yes, researchers would need to obtain consent from participants and explain the risk of backwards identification, but this is something that many participants could probably live with. At the NAS event, held in Washington DC, a patient representative was relatively sanguine about the use of his experimental data, pointing out that many patients entered trials for altruistic reasons and would expect data to be used in a way that has the maximum possible beneficial impact.
This underlines the faultiness of the current default position regarding clinical trials and epidemiological studies, to the effect that making the raw data widely available is dangerous. Instead, opening up these participant-derived data should be regarded as a worthwhile process that needs to be supported and managed carefully.
In the absence of explicit, advance consent from patients, making anonymised clinically derived data openly available is probably a step too far for most institutions. However, they should actively facilitate the anonymisation and pseudonymisation of patient and participant data underlying their published outputs. And they should establish clear operating procedures on how and with whom these data can be shared.
Moreover, when data are not derived from human subjects, researchers should be routinely expected to upload their underlying raw data and metadata into a database as soon as their publication is listed on their institution’s repository. This should be done in accordance with the “FAIR” principles of being findable, accessible, interoperable and reusable.
Such an approach would allow institutions to audit data published in their name and enable rapid screening for research integrity issues.
You may not read it in bacteriology textbooks, but sunlight really does have the potential to eradicate most pathological behaviour?–?if only we fully open science’s curtains.
Jonathan Grigg is professor of paediatric respiratory and environmental medicine at Queen Mary University of London’s School of Medicine and Dentistry, where he is also deputy dean for research integrity.