This research examines how five free of charge Large Language Models (LLMs)-Zephyr, Mistral 7B, LLAMA 2 7B, ChatGPT 3.5 and Copilot Precise-perform when faced with a nursing clinical scenario involving a neuropsychiatric emergency. Their responses were evaluated based on established guidelines by a Delphi consensus using a 5-point Likert scale to rate safety, accuracy, reliability and the potential for improvement. The findings underscore the greatest importance of safety and accuracy metrics. LLAMA 2 7B exhibits balanced but poor performance, scoring 3 out of 5 in Safety, Accuracy, and References, and 4 out of 5 in providing Improvement suggestions. ChatGPT 3.5 demonstrates adequate performance in Safety, Accuracy, and References, each with a score of 4 out of 5, indicating its proficiency in generating accurate, reliable content and ensuring patient safety, though there is room for improvement in enhancement suggestions (3 out of 5). Copilot Precise shows a unique profile, with balanced scores of 3 out of 5 in Safety, Accuracy, and Improvements, and a perfect score of 5 out of 5 only in References, highlighting its high accuracy in generating references. Reliability was reported in terms of both reference precision criteria and consistency over time computed through automated assessment. These preliminary results underscore the importance of developing language models that focus on ensuring safety and precision, in clinical decision-making scenarios. Further studies should aim to improve the accuracy and dependability of these models by examining a range of situations and incorporating real-time feedback mechanisms from experts. This will enhance their usefulness in clinical environments.

Assessing and Comparing Free Large Language Models’ Responses to a Clinical Case: Accuracy, Safety, and Reliability

Cicolini, Giancarlo
2025-01-01

Abstract

This research examines how five free of charge Large Language Models (LLMs)-Zephyr, Mistral 7B, LLAMA 2 7B, ChatGPT 3.5 and Copilot Precise-perform when faced with a nursing clinical scenario involving a neuropsychiatric emergency. Their responses were evaluated based on established guidelines by a Delphi consensus using a 5-point Likert scale to rate safety, accuracy, reliability and the potential for improvement. The findings underscore the greatest importance of safety and accuracy metrics. LLAMA 2 7B exhibits balanced but poor performance, scoring 3 out of 5 in Safety, Accuracy, and References, and 4 out of 5 in providing Improvement suggestions. ChatGPT 3.5 demonstrates adequate performance in Safety, Accuracy, and References, each with a score of 4 out of 5, indicating its proficiency in generating accurate, reliable content and ensuring patient safety, though there is room for improvement in enhancement suggestions (3 out of 5). Copilot Precise shows a unique profile, with balanced scores of 3 out of 5 in Safety, Accuracy, and Improvements, and a perfect score of 5 out of 5 only in References, highlighting its high accuracy in generating references. Reliability was reported in terms of both reference precision criteria and consistency over time computed through automated assessment. These preliminary results underscore the importance of developing language models that focus on ensuring safety and precision, in clinical decision-making scenarios. Further studies should aim to improve the accuracy and dependability of these models by examining a range of situations and incorporating real-time feedback mechanisms from experts. This will enhance their usefulness in clinical environments.
2025
9783031897030
9783031897047
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11564/867177
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact