Article Text
Abstract
Background and Importance Assessing ChatGPT’s performance in the Health Training exam for Pharmacy specialisation (FIR) holds significance in gauging AI’s role in healthcare education.
Aim and Objectives To assess ChatGPT’s ability to respond to and potentially pass the Health Training exam for Pharmacy specialisation (FIR).
Material and Methods A multidisciplinary team consisting of hospital pharmacists, physicians and biomedical engineers selected an exam version for the 2022 session. One question was excluded due to the presence of an image. A brief introduction, providing context about the FIR exam and its contents, was added at the beginning of the conversation.
ChatGPT’s performance, defined as the percentage of correct answers, was evaluated through three different approaches:
Two sets of 50 randomly selected questions were manually input into the OpenAI web interface during the same conversation.
A total of 209 questions, including both questions and their four possible answers were solved by the Application Programming Interface (API) for Python from a spreadsheet.
Open-ended questions lacking predefined possible answers were extracted by API for Python, followed by the application of Natural Language Processing (NLP). NLP assessed the similarity between API-generated responses and actual responses, providing a more accurate evaluation of ChatGPT’s human-like performance in a multiple-choice exam. The similarity metric compared feature vectors of sentences and generated a value representing the degree of similarity, with a maximum value of 1 signifying a perfect match and thus a correct answer.
Correct answers received a value of 3 points, while incorrect ones subtract incurred a deduction of 1 point. In the 2022 call, a minimum score of 97 points was necessary to be eligible for allocation of FIR positions.
Results Using the manual inclusion method, we achieved 60% and 66% accuracy in 50 randomly selected questions (score equivalent to 280 and 328 points, respectively). The second method yielded a success a success rate of 45.5 to 49.0%, equating to 164–192 points. In the third method, values of 50.2–52.6% (200–220 points) were obtained.
Conclusion and Relevance The findings demonstrate ChatGPT’s variable ability to provide correct responses to FIR questions depending on the methodology employed. Regardless of the approach, ChatGPT consistently achieved the minimum score required for participation in the allocation of FIR positions.
Conflict of Interest No conflict of interest.