Skip to main content

Table 2 Results on the Bannach-Brown 2016 database

From: Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

 

Accuracy (95% CI)

Sensitivity (95% CI)

Specificity (95% CI)

Zero-shot methoda,b

0.839 [0.820, 0.858]

0.304 [0.246, 0.368]

0.937 [0.922, 0.950]

Hybrid methodb

0.961 [0.950, 0.970]

0.843 [0.790, 0.888]

0.982 [0.974, 0.989]

ChatGPT (v4.0)

0.945 [0.904, 0.972]

0.930 [0.861, 0.971]

0.960 [0.901, 0.989]

ChatGPT (v3.5)

0.905 [0.856, 0.942]

0.940 [0.874, 0.978]

0.870 [0.788, 0.929]

Google PaLM 2c

0.900 [0.850, 0.938]

0.850 [0.765, 0.914]

0.950 [0.887, 0.984]

Meta Llama 2

0.780 [0.716, 0.835]

0.950 [0.887, 0.984]

0.610 [0.507, 0.706]

Majority voting

0.915 [0.867, 0.950]

0.960 [0.901, 0.989]

0.870 [0.788, 0.929]

LCA modeld

0.945 [0.904, 0.972]

0.930 [0.861, 0.971]

0.960 [0.901, 0.989]

Results from the latest LLM models

 ChatGPT-3.5-Turbo

0.870 [0.815, 0.913]

0.830 [0.742, 0.898]

0.910 [0.836, 0.958]

 ChatGPT-4-Turbo

0.830 [0.771, 0.879]

0.670 [0.569, 0.761]

0.990 [0.946, 1.000]

 Gemini-1.0-pro

0.870 [0.815, 0.913]

0.750 [0.653, 0.831]

0.990 [0.946, 1.000]

 Llama 3

0.910 [0.861, 0.946]

0.930 [0.861, 0.971]

0.890 [0.812, 0.944]

 Claude 3 Opus

0.920 [0.873, 0.954]

0.900 [0.824, 0.951]

0.940 [0.874, 0.978]

  1. aZero-shot is based on Open AI’s babbage embedding, and hybrid then trained the model based on curated label for the top 10% cases identified by zero-shot
  2. bPerformance summaries for zero-shot and hybrid method are based on all 230 positive abstracts and all 1258 negative abstracts
  3. cGoogle PaLM 2 might generate empty responses to some abstracts, and we assumed all null outputs as missing. “Majority voting” refers to majority voting without using decisions from Google PaLM2
  4. dThe LCA model part, which can be seen as a more sophisticated version of “majority voting,” is explained in the “Beyond majority voting” section. These notes also apply to Tables 3 and 4