Skip to main content

Table 3 Results on the Meijboom 2021 database

From: Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

 

Accuracy (95% CI)

Sensitivity (recall) (95% CI)

Specificity (95% CI)

Zero-shot method

0.862 [0.833, 0.888]

0.125 [0.035, 0.290]

0.902 [0.875, 0.924]

Hybrid method

0.954 [0.935, 0.969]

0.065 [0.008, 0.214]

1.000 [0.994, 1.000]

ChatGPT (v4.0)

0.848 [0.776, 0.905]

0.812 [0.636, 0.928]

0.860 [0.776, 0.921]

ChatGPT (v3.5)

0.591 [0.502, 0.676]

0.969 [0.838, 0.999]

0.470 [0.369, 0.572]

Google PaLM 2

0.890 [0.802, 0.949]

0.647 [0.383, 0.858]

0.954 [0.871, 0.990]

Meta Llama 2

0.636 [0.548, 0.718]

1.000 [0.891, 1.000]

0.520 [0.418, 0.621]

Majority voting

0.720 [0.635, 0.794]

1.000 [0.891, 1.000]

0.630 [0.528, 0.724]

LCA model

0.841 [0.778, 0.904]

1.000 [0.891, 1.000]

0.790 [0.710, 0.870]

Results from the latest LLM models

 ChatGPT-3.5-Turbo

0.667 [0.580, 0.747]

0.970 [0.840, 0.999]

0.570 [0.467, 0.669)

 ChatGPT-4-Turbo

0.840 [0.766, 0.898]

0.560 [0.374, 0.734]

0.930 [0.861, 0.971]

 Gemini-1.0-pro

0.819 [0.743, 0.881]

0.380 [0.215, 0.568]

0.960 [0.901, 0.989]

 Llama 3

0.894 [0.829, 0.941]

0.970 [0.840, 0.999]

0.870 [0.788, 0.929]

 Claude 3 Opus

0.857 [0.785, 0.912]

0.910 [0.755, 0.982]

0.840 [0.753, 0.906]