Analyzing ChatGPT 3.5’s Performance on the Polish Specialty Exam in Palliative Medicine: Strengths and Limitations
DOI:
https://doi.org/10.69139/gk8sm619Keywords:
ChatGPT-3.5, artificial intelligence, large language models, medical specialty examination, palliative medicineAbstract
Background: Artificial intelligence (AI) has growing applications in medicine. This study examines whether the ChatGPT-3.5 language model can pass the Polish Specialty Examination (PES) in Palliative Medicine and evaluates its accuracy in various medical question categories. Additionally, the study highlights limitations AI faces in this medical context.
Material and methods: A total of 120 PES questions from spring 2023 were used, presented in single-choice format with four options. Each question was asked five times, and an answer was considered correct if ChatGPT responded correctly in at least 3 of 5 attempts. A confidence score was calculated based on the frequency of correct answers. Questions were also categorized by Bloom’s Taxonomy to assess complexity.
Results: The minimum passing score for the PES was 60%, while ChatGPT achieved 53.33%, falling short of the required threshold. Performance varied by question type, with higher accuracy in clinical management (63.16%) and memory-based questions (54.32%) compared to critical thinking (51.28%). The language model showed a notable confidence rate in correct responses.
Conclusion: ChatGPT-3.5’s performance on the PES in Palliative Medicine is shaped by its access to knowledge bases, but it lacks the practical experience essential to medical practice. Human expertise, which incorporates social, emotional, and cultural insights, remains critical for personalized patient care, underscoring the importance of human advantage in palliative care settings.
Downloads
References
Aydin, O., & Karaarslan, E. (2023). Is ChatGPT leading generative AI? What is beyond expectations? Academic Platform – Journal of Engineering and Science, 11(3). https://doi.org/10.21541/apjess.1293702
Haleem, A., Javaid, M., & Singh, R. P. (2022). An era of ChatGPT as a significant futuristic support tool: A study on features, abilities, and challenges. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2(4), 100089. https://doi.org/10.1016/j.tbench.2023.100089
Bansal, G., Chamola, V., Hussain, A., et al. (2024). Transforming conversations with AI—A comprehensive study of ChatGPT. Cognitive Computation, 16, 2487–2510. https://doi.org/10.1007/s12559-023-10236-2
Ramesh, A., Kambhampati, C., Monson, J., & Drew, P. (2004). Artificial intelligence in medicine. Annals of The Royal College of Surgeons of England, 86(5), 334–338. https://doi.org/10.1308/147870804290
Busnatu, Ș., Niculescu, A. G., Bolocan, A., et al. (2022). Clinical applications of artificial intelligence—An updated overview. Journal of Clinical Medicine, 11(8), 2265. https://doi.org/10.3390/jcm11082265
Xie, W., & Butcher, R. (2023). Artificial intelligence decision support tools for end-of-life care planning conversations: CADTH Horizon Scan. Canadian Agency for Drugs and Technologies in Health.
Brender, T. D., Smith, A. K., & Block, B. L. (2024). Can artificial intelligence speak for incapacitated patients at the end of life? JAMA Internal Medicine, 184(9), 1005. https://doi.org/10.1001/jamainternmed.2024.2676
Jin, H. K., Lee, H. E., & Kim, E. (2024). Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: A systematic review and meta-analysis. BMC Medical Education, 24(1). https://doi.org/10.1186/s12909-024-05944-8
Farhat, F., Chaudry, B. M., Nadeem, M., Sohail, S. S., & Madsen, D. (2023). Evaluating AI models for the national pre-medical exam in India: A head-to-head analysis of ChatGPT-3.5, GPT-4, and Bard (Preprint). JMIR Medical Education. https://doi.org/10.2196/preprints.51523
Lim, Z. W., Pushpanathan, K., Yew, S. M. E., et al. (2023). Benchmarking large language models’ performances for myopia care: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine, 95, 104770. https://doi.org/10.1016/j.ebiom.2023.104770
Ruby, D. (2023). ChatGPT statistics for 2023: Comprehensive facts and data. DemandSage.
Raiaan, M. A. K., Mukta, M. S. H., Fatema, K., & Fahad, N. M. (2024). A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access.
Kaneda, Y., Takahashi, R., Kaneda, U., et al. (2023). Assessing the performance of GPT-3.5 and GPT-4 on the 2023 Japanese nursing examination. Cureus, 15(8), e42924. https://doi.org/10.7759/cureus.42924
Liu, M., Okuhara, T., Chang, X., Shirabe, R., Nishiie, Y., Okada, H., & Kiuchi, T. (2024). Performance of ChatGPT across different versions in medical licensing examinations worldwide: Systematic review and meta-analysis. Journal of Medical Internet Research, 26, e60807. https://doi.org/10.2196/60807
Schmidl, B., Hütten, T., Pigorsch, S., et al. (2024). Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. European Archives of Oto-Rhino-Laryngology, 281(11), 6099–6109. https://doi.org/10.1007/s00405-024-08828-1
Schmidl, B., Hütten, T., Pigorsch, S., et al. (2024). Assessing the role of advanced artificial intelligence as a tool in multidisciplinary tumor board decision-making for recurrent/metastatic head and neck cancer cases. Frontiers in Oncology, 14, 1455413. https://doi.org/10.3389/fonc.2024.1455413
Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. Addison Wesley Longman.
Bloom, B. S. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. Longmans, Green.
McHugh, M. L. (2011). Multiple comparison analysis testing in ANOVA. Biochemia Medica, 21(3), 203–209. https://doi.org/10.11613/bm.2011.029
Centrum Egzaminów Medycznych. (n.d.). Statystyki egzaminów specjalizacyjnych. https://www.cem.edu.pl/aktualnosci/spece/spece_stat.php
Kufel, J., Paszkiewicz, I., Bielówka, M., Bartnikowska, W., Janik, M., Stencel, M., et al. (n.d.). Will ChatGPT pass the Polish specialty exam in radiology and diagnostic imaging? Polish Journal of Radiology.
Rojek, M., Kufel, J., Bielówka, M., et al. (2024). Exploring the performance of ChatGPT-3.5 in addressing dermatological queries. Dermatology Review, 111(1), 26–30. https://doi.org/10.5114/dr.2024.140796
Bielówka, M., Kufel, J., Rojek, M., et al. (2024). Evaluating ChatGPT-3.5 in allergology: Performance in the Polish specialist examination. Alergologia Polska, 11, 42–47.
Kufel, J., Bielówka, M., Rojek, M., Mitręga, A., Czogalik, Ł., Kaczyńska, D., Kondoł, D., Palkij, K., & Mielcarska, S. (2024). Assessing ChatGPT’s performance in national nuclear medicine specialty examination. Iranian Journal of Nuclear Medicine, 32(1), 60–65.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Piotr Dudek, Dominika Kaczyńska, Natalia Denisiewicz, Adam Mitręga, Michał Bielówka, Łukasz Czogalik, Marcin Rojek (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
