From: Evaluating the effectiveness of large language models in abstract screening: a comparative analysis
 | Accuracy (95% CI) | Sensitivity (recall) (95% CI) | Specificity (95% CI) |
---|---|---|---|
Zero-shot method | 0.862 [0.833, 0.888] | 0.125 [0.035, 0.290] | 0.902 [0.875, 0.924] |
Hybrid method | 0.954 [0.935, 0.969] | 0.065 [0.008, 0.214] | 1.000 [0.994, 1.000] |
ChatGPT (v4.0) | 0.848 [0.776, 0.905] | 0.812 [0.636, 0.928] | 0.860 [0.776, 0.921] |
ChatGPT (v3.5) | 0.591 [0.502, 0.676] | 0.969 [0.838, 0.999] | 0.470 [0.369, 0.572] |
Google PaLM 2 | 0.890 [0.802, 0.949] | 0.647 [0.383, 0.858] | 0.954 [0.871, 0.990] |
Meta Llama 2 | 0.636 [0.548, 0.718] | 1.000 [0.891, 1.000] | 0.520 [0.418, 0.621] |
Majority voting | 0.720 [0.635, 0.794] | 1.000 [0.891, 1.000] | 0.630 [0.528, 0.724] |
LCA model | 0.841 [0.778, 0.904] | 1.000 [0.891, 1.000] | 0.790 [0.710, 0.870] |
Results from the latest LLM models | |||
 ChatGPT-3.5-Turbo | 0.667 [0.580, 0.747] | 0.970 [0.840, 0.999] | 0.570 [0.467, 0.669) |
 ChatGPT-4-Turbo | 0.840 [0.766, 0.898] | 0.560 [0.374, 0.734] | 0.930 [0.861, 0.971] |
 Gemini-1.0-pro | 0.819 [0.743, 0.881] | 0.380 [0.215, 0.568] | 0.960 [0.901, 0.989] |
 Llama 3 | 0.894 [0.829, 0.941] | 0.970 [0.840, 0.999] | 0.870 [0.788, 0.929] |
 Claude 3 Opus | 0.857 [0.785, 0.912] | 0.910 [0.755, 0.982] | 0.840 [0.753, 0.906] |