Artificial intelligence is not yet ready to fully assess the quality of research outputs, but should be explored for use in certain parts of the Research Excellence Framework (REF) in future, according to a study.?
Commissioned by the UK’s four higher education funding bodies, the paper said pilot testing should be used to determine whether AI predictions for scoring decisions could be used in smaller ways to complement the manual process.
As part of a wider Future Research Assessment Programme, the AI system designed at the University of Wolverhampton used machine learning to predict scores for journal articles by identifying patterns in those given by humans.
If successful, it was thought automation could help reduce costs on the “labour-intensive” REF process, which takes up a “substantial amount” of time of the over 1,000 experts who review outputs in subpanels over the course of a year.
Researchers found that the accuracy of the system varied substantially between units of assessment and application strategies, with predictions as “poor as guessing” in some cases but as accurate as 85 per cent in others.
The study deployed five different strategies to assess machine learning’s predictions based on 1,000 properties extracted from each article in what was the first-ever full-scale evaluation of AI for a national research evaluation system.
The paper suggested that advantages of the use of AI in research evaluation were the potential for future improvement, and increased objectivity.
This would mean that the same output from multiple institutions in the same unit of assessment would always get the same score and could reduce human bias.
However, the study warned the AI might be biased against some types of research that score badly on traditional metrics, such as humanities-oriented contributions to medicine.
The Statistical Cybermetrics and Research Evaluation Group also warned that involving AI would make the evaluation more complex and less understandable, and that any incorrect predictions might cause the REF to lose credibility.
Therefore, despite an appetite among panel members to reduce their “considerable burden”, the paper concluded that AI should only be used to support peer review and not usurp it.
“Peer review is at the heart of REF and AI systems cannot yet replace human judgements,” it said.
“They can currently only exploit shallow attributes of articles to guess their quality and are not capable of assessing any meaningful aspects of originality, robustness and significance.”
Researchers said AI predictions were not accurate enough to replace peer review scores, or reduce the number of peer reviewers within a subpanel. A separate review has also concluded that an all-metric approach to the REF should also be avoided.?
But pilot testing should be used to assess whether using AI predictions and prediction probabilities alongside, or instead of, bibliometric data would be helpful for any units of assessment.
This would include helping “mop up difficult scoring decisions” near the end of the assessment period, to gain interdisciplinary input, as a tiebreaker, or to cross-check final scores.