Analyzing ChatGPT 3.5’s Performance on the Polish Specialty Exam in Palliative Medicine: Strengths and Limitations

Piotr Dudek; Dominika Kaczyńska; Natalia Denisiewicz; Adam Mitręga; Michał Bielówka; Łukasz  Czogalik; Marcin Rojek

doi:10.69139/gk8sm619

Authors

Piotr Dudek Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland Author
Dominika Kaczyńska Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland Author
Natalia Denisiewicz Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland Author
Adam Mitręga Department of Biophysics, Faculty of Medical Sciences in Zabrze, Medical University of Silesia in Katowice, Jordana 18, 40-043 Zabrze, Poland Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland Author
Michał Bielówka Department of Biophysics, Faculty of Medical Sciences in Zabrze, Medical University of Silesia in Katowice, Jordana 18, 40-043 Zabrze, Poland Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland Author
Łukasz Czogalik Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland Author
Marcin Rojek Students’ Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine, Medical University of Silesia, Katowice, Poland Author

DOI:

https://doi.org/10.69139/gk8sm619

Keywords:

ChatGPT-3.5, artificial intelligence, large language models, medical specialty examination, palliative medicine

Abstract

Background: Artificial intelligence (AI) has growing applications in medicine. This study examines whether the ChatGPT-3.5 language model can pass the Polish Specialty Examination (PES) in Palliative Medicine and evaluates its accuracy in various medical question categories. Additionally, the study highlights limitations AI faces in this medical context.

Material and methods: A total of 120 PES questions from spring 2023 were used, presented in single-choice format with four options. Each question was asked five times, and an answer was considered correct if ChatGPT responded correctly in at least 3 of 5 attempts. A confidence score was calculated based on the frequency of correct answers. Questions were also categorized by Bloom’s Taxonomy to assess complexity.

Results: The minimum passing score for the PES was 60%, while ChatGPT achieved 53.33%, falling short of the required threshold. Performance varied by question type, with higher accuracy in clinical management (63.16%) and memory-based questions (54.32%) compared to critical thinking (51.28%). The language model showed a notable confidence rate in correct responses.

Conclusion: ChatGPT-3.5’s performance on the PES in Palliative Medicine is shaped by its access to knowledge bases, but it lacks the practical experience essential to medical practice. Human expertise, which incorporates social, emotional, and cultural insights, remains critical for personalized patient care, underscoring the importance of human advantage in palliative care settings.

Downloads

Download data is not yet available.

References

Aydin, O., & Karaarslan, E. (2023). Is ChatGPT leading generative AI? What is beyond expectations? Academic Platform – Journal of Engineering and Science, 11(3). https://doi.org/10.21541/apjess.1293702

Haleem, A., Javaid, M., & Singh, R. P. (2022). An era of ChatGPT as a significant futuristic support tool: A study on features, abilities, and challenges. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2(4), 100089. https://doi.org/10.1016/j.tbench.2023.100089

Bansal, G., Chamola, V., Hussain, A., et al. (2024). Transforming conversations with AI—A comprehensive study of ChatGPT. Cognitive Computation, 16, 2487–2510. https://doi.org/10.1007/s12559-023-10236-2

Ramesh, A., Kambhampati, C., Monson, J., & Drew, P. (2004). Artificial intelligence in medicine. Annals of The Royal College of Surgeons of England, 86(5), 334–338. https://doi.org/10.1308/147870804290

Busnatu, Ș., Niculescu, A. G., Bolocan, A., et al. (2022). Clinical applications of artificial intelligence—An updated overview. Journal of Clinical Medicine, 11(8), 2265. https://doi.org/10.3390/jcm11082265

Xie, W., & Butcher, R. (2023). Artificial intelligence decision support tools for end-of-life care planning conversations: CADTH Horizon Scan. Canadian Agency for Drugs and Technologies in Health.

Brender, T. D., Smith, A. K., & Block, B. L. (2024). Can artificial intelligence speak for incapacitated patients at the end of life? JAMA Internal Medicine, 184(9), 1005. https://doi.org/10.1001/jamainternmed.2024.2676

Jin, H. K., Lee, H. E., & Kim, E. (2024). Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: A systematic review and meta-analysis. BMC Medical Education, 24(1). https://doi.org/10.1186/s12909-024-05944-8

Farhat, F., Chaudry, B. M., Nadeem, M., Sohail, S. S., & Madsen, D. (2023). Evaluating AI models for the national pre-medical exam in India: A head-to-head analysis of ChatGPT-3.5, GPT-4, and Bard (Preprint). JMIR Medical Education. https://doi.org/10.2196/preprints.51523

Lim, Z. W., Pushpanathan, K., Yew, S. M. E., et al. (2023). Benchmarking large language models’ performances for myopia care: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine, 95, 104770. https://doi.org/10.1016/j.ebiom.2023.104770

Ruby, D. (2023). ChatGPT statistics for 2023: Comprehensive facts and data. DemandSage.

Raiaan, M. A. K., Mukta, M. S. H., Fatema, K., & Fahad, N. M. (2024). A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access.

Kaneda, Y., Takahashi, R., Kaneda, U., et al. (2023). Assessing the performance of GPT-3.5 and GPT-4 on the 2023 Japanese nursing examination. Cureus, 15(8), e42924. https://doi.org/10.7759/cureus.42924

Liu, M., Okuhara, T., Chang, X., Shirabe, R., Nishiie, Y., Okada, H., & Kiuchi, T. (2024). Performance of ChatGPT across different versions in medical licensing examinations worldwide: Systematic review and meta-analysis. Journal of Medical Internet Research, 26, e60807. https://doi.org/10.2196/60807

Schmidl, B., Hütten, T., Pigorsch, S., et al. (2024). Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. European Archives of Oto-Rhino-Laryngology, 281(11), 6099–6109. https://doi.org/10.1007/s00405-024-08828-1

Schmidl, B., Hütten, T., Pigorsch, S., et al. (2024). Assessing the role of advanced artificial intelligence as a tool in multidisciplinary tumor board decision-making for recurrent/metastatic head and neck cancer cases. Frontiers in Oncology, 14, 1455413. https://doi.org/10.3389/fonc.2024.1455413

Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. Addison Wesley Longman.

Bloom, B. S. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. Longmans, Green.

McHugh, M. L. (2011). Multiple comparison analysis testing in ANOVA. Biochemia Medica, 21(3), 203–209. https://doi.org/10.11613/bm.2011.029

Centrum Egzaminów Medycznych. (n.d.). Statystyki egzaminów specjalizacyjnych. https://www.cem.edu.pl/aktualnosci/spece/spece_stat.php

Kufel, J., Paszkiewicz, I., Bielówka, M., Bartnikowska, W., Janik, M., Stencel, M., et al. (n.d.). Will ChatGPT pass the Polish specialty exam in radiology and diagnostic imaging? Polish Journal of Radiology.

Rojek, M., Kufel, J., Bielówka, M., et al. (2024). Exploring the performance of ChatGPT-3.5 in addressing dermatological queries. Dermatology Review, 111(1), 26–30. https://doi.org/10.5114/dr.2024.140796

Bielówka, M., Kufel, J., Rojek, M., et al. (2024). Evaluating ChatGPT-3.5 in allergology: Performance in the Polish specialist examination. Alergologia Polska, 11, 42–47.

Kufel, J., Bielówka, M., Rojek, M., Mitręga, A., Czogalik, Ł., Kaczyńska, D., Kondoł, D., Palkij, K., & Mielcarska, S. (2024). Assessing ChatGPT’s performance in national nuclear medicine specialty examination. Iranian Journal of Nuclear Medicine, 32(1), 60–65.

Analyzing ChatGPT 3.5’s Performance on the Polish Specialty Exam in Palliative Medicine: Strengths and Limitations

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Information

Latest publications

Open Access