This study aims to assess the validity of algorithms used to identify cases of myocarditis and deep vein thrombosis from hospital discharge forms. Notably, SeValid introduces an innovative approach by estimating the sensitivity of ‘narrow’ algorithms through the positive predictive value assessment of ‘possible’ algorithms.
Oral presentation
An experiment from SeValid: Integration of Artificial Intelligence in the Clinical Validation Pipeline of VAC4EU
Hyeraci G, Lippi M, Nardoni V... ICPE 2025 Washington, DC
Background: Validation of study outcomes is crucial to assess the accuracy and reliability of results generated by real-world studies, and have the potential of providing input for quantitative bias analysis. VAC4EU has established a pipeline to estimate outcome positive predictive value (PPV). The SeValid project, co-funded by ARS Toscana and VAC4EU, aims to develop a comprehensive strategy to facilitate validation studies and reduce misclassification bias. Large language models (LLMs) hold the potential to provide a second assessment and/or conduct validation on larger samples
Objectives: To integrate the VAC4EU validation pipeline, based on human assessment, with a LLM, using an algorithm to identify myocarditis events in hospital discharge records as a case study
Methods: A 16-items clinical questionnaire was used, designed by VAC4EU to verify whether an event is a true myocarditis case based on the Brighton Collaboration definition. Medical charts (MC) of an event are classified as cases of myocarditis (C), as non-cases (NC) or as non assessable (NA). Dummy MC provided by VAC4EU were used both to train a human assessor (H) and to design prompts for an open-source LLM (Gemma2 9B). The prompts were refined with additional dummy MC. Real events were extracted from the hospital discharge records of a hospital in Florence (Italy) recorded between July 2022 and June 2024, using the following algorithm: diagnosis codes in any position match ≥ 1 of the diagnosis ICD9CM codes 09382, 1303, 03282, 03643, 07423, 3912, 3980, 422, 4290. For each event, H and LLM accessed MC to fill out the questionnaire. Concordance between H and LLM was assessed. A medical expert provided a third opinion on MC with discordant results. The reference standard (RS) was formed by either the concordant assignments of H and LLM, or the third opinion. PPV of the algorithm was calculated using H, LLM and RS. RS was also used to assess validity of H and LLM
Results: Events were 38. Out of 38 pairs of assessments (H,LLM), 28 (74%) were concordant, all (C,C). Discordant pairs were: (C,NA) (N = 6, 16%), (C,NC) (N= 3, 8%) and (NC,C) (N = 1, 3%). Algorithm’s PPV were 100% and 91%, respectively. RS had 36 C, 2 NC, 0 NA (PPV = 95%). Sensitivity of H and LLM were 97% and 78% respectively, while the two true NC were misclassified by both (specificity = 0)
Conclusions: The LLM proved exceedingly cautious, and often provided wrong NA or NC results. Two NC were found by RS, likely complex to classify as neither H nor LLM identified them. This experiment provides useful hints to improve the pipeline in future applications. Prompts of LLM should be refined iteratively, possibly using real MC. The study highlights risks and opportunities for LLMs supporting outcome validation and improving generation of high-quality real world evidence based on real world data
2025-08-25
Join our community
Subscribe to our newsletter to receive the latest updates, news, and events from the VAC4EU project.