From: Evaluating the effectiveness of large language models in abstract screening: a comparative analysis
Accuracy (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | |
---|---|---|---|
Zero-shot methoda,b | 0.839 [0.820, 0.858] | 0.304 [0.246, 0.368] | 0.937 [0.922, 0.950] |
Hybrid methodb | 0.961 [0.950, 0.970] | 0.843 [0.790, 0.888] | 0.982 [0.974, 0.989] |
ChatGPT (v4.0) | 0.945 [0.904, 0.972] | 0.930 [0.861, 0.971] | 0.960 [0.901, 0.989] |
ChatGPT (v3.5) | 0.905 [0.856, 0.942] | 0.940 [0.874, 0.978] | 0.870 [0.788, 0.929] |
Google PaLM 2c | 0.900 [0.850, 0.938] | 0.850 [0.765, 0.914] | 0.950 [0.887, 0.984] |
Meta Llama 2 | 0.780 [0.716, 0.835] | 0.950 [0.887, 0.984] | 0.610 [0.507, 0.706] |
Majority voting | 0.915 [0.867, 0.950] | 0.960 [0.901, 0.989] | 0.870 [0.788, 0.929] |
LCA modeld | 0.945 [0.904, 0.972] | 0.930 [0.861, 0.971] | 0.960 [0.901, 0.989] |
Results from the latest LLM models | |||
ChatGPT-3.5-Turbo | 0.870 [0.815, 0.913] | 0.830 [0.742, 0.898] | 0.910 [0.836, 0.958] |
ChatGPT-4-Turbo | 0.830 [0.771, 0.879] | 0.670 [0.569, 0.761] | 0.990 [0.946, 1.000] |
Gemini-1.0-pro | 0.870 [0.815, 0.913] | 0.750 [0.653, 0.831] | 0.990 [0.946, 1.000] |
Llama 3 | 0.910 [0.861, 0.946] | 0.930 [0.861, 0.971] | 0.890 [0.812, 0.944] |
Claude 3 Opus | 0.920 [0.873, 0.954] | 0.900 [0.824, 0.951] | 0.940 [0.874, 0.978] |