From: Evaluating the effectiveness of large language models in abstract screening: a comparative analysis
 | Accuracy (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) |
---|---|---|---|
Zero-shot method | 0.882 [0.860, 0.902] | 0.384 [0.272, 0.505] | 0.923 [0.904, 0.940] |
Hybrid method | 0.951 [0.936, 0.964] | 0.708 [0.589, 0.810] | 0.971 [0.958, 0.981] |
ChatGPT (v4.0) | 0.913 [0.861, 0.951] | 0.932 [0.847, 0.977] | 0.900 [0.824, 0.951] |
ChatGPT (v3.5) | 0.711 [0.637, 0.777] | 0.315 [0.211, 0.434] | 1.000 [0.964, 1.000] |
Google PaLM 2 | 0.569 [0.486, 0.648] | 0.000 [0.000, 0.054] | 1.000 [0.958, 1.000] |
Meta Llama 2 | 0.827 [0.762, 0.880] | 0.808 [0.699, 0.891] | 0.840 [0.753, 0.906] |
Majority voting | 0.884 [0.827, 0.928] | 0.808 [0.699, 0.891] | 0.940 [0.874, 0.978] |
LCA model | 0.879 [0.830, 0.928] | 0.945 [0.893, 0.997] | 0.830 [0.756, 0.904] |
Results from the latest LLM models | |||
 ChatGPT-3.5-Turbo | 0.753 [0.682, 0.816] | 0.470 [0.352, 0.590] | 0.960 [0.901, 0.989] |
 ChatGPT-4-Turbo | 0.876 [0.817, 0.921] | 0.760 [0.646, 0.852] | 0.960 [0.901, 0.989] |
 Gemini-1.0-pro | 0.927 [0.877, 0.961] | 0.950 [0.872, 0.987] | 0.910 [0.836, 0.958] |
 Llama 3 | 0.920 [0.869, 0.955] | 0.960 [0.886, 0.992] | 0.890 [0.812, 0.944] |
 Claude 3 Opus | 0.804 [0.737, 0.861] | 0.550 [0.429, 0.667] | 0.990 [0.946, 1.000] |