Skip to main content

Table 4 Results on the Menon 2022 database

From: Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

 

Accuracy (95% CI)

Sensitivity (95% CI)

Specificity (95% CI)

Zero-shot method

0.882 [0.860, 0.902]

0.384 [0.272, 0.505]

0.923 [0.904, 0.940]

Hybrid method

0.951 [0.936, 0.964]

0.708 [0.589, 0.810]

0.971 [0.958, 0.981]

ChatGPT (v4.0)

0.913 [0.861, 0.951]

0.932 [0.847, 0.977]

0.900 [0.824, 0.951]

ChatGPT (v3.5)

0.711 [0.637, 0.777]

0.315 [0.211, 0.434]

1.000 [0.964, 1.000]

Google PaLM 2

0.569 [0.486, 0.648]

0.000 [0.000, 0.054]

1.000 [0.958, 1.000]

Meta Llama 2

0.827 [0.762, 0.880]

0.808 [0.699, 0.891]

0.840 [0.753, 0.906]

Majority voting

0.884 [0.827, 0.928]

0.808 [0.699, 0.891]

0.940 [0.874, 0.978]

LCA model

0.879 [0.830, 0.928]

0.945 [0.893, 0.997]

0.830 [0.756, 0.904]

Results from the latest LLM models

 ChatGPT-3.5-Turbo

0.753 [0.682, 0.816]

0.470 [0.352, 0.590]

0.960 [0.901, 0.989]

 ChatGPT-4-Turbo

0.876 [0.817, 0.921]

0.760 [0.646, 0.852]

0.960 [0.901, 0.989]

 Gemini-1.0-pro

0.927 [0.877, 0.961]

0.950 [0.872, 0.987]

0.910 [0.836, 0.958]

 Llama 3

0.920 [0.869, 0.955]

0.960 [0.886, 0.992]

0.890 [0.812, 0.944]

 Claude 3 Opus

0.804 [0.737, 0.861]

0.550 [0.429, 0.667]

0.990 [0.946, 1.000]