Skip to main content

Table 1 Performance of model gpt-4-1106 at different probability cutoffs. The performance is measured via specificity, sensitivity, precision, F-measure (Harmonic mean of P and R), and works saved over sampling metric (WWS) according to Cohen et al. [21]

From: Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis

Relevance probability scores

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Specificity = TN / (TN + FP)

0.00

0.11

0.29

0.32

0.49

0.55

0.55

0.62

0.79

0.97

1.00

Recall (Sensitivity)

R = TP/(TP + FN)

1.00

1.00

1.00

1.00

1.00

1.00

1.00

0.99

0.96

0.49

0.05

Precision (P) = TP/(TP + FP)

0.01

0.01

0.02

0.02

0.02

0.03

0.03

0.03

0.05

0.17

0.35

F-measure F = 2*P*R/(P + R)

0.02

0.03

0.03

0.03

0.04

0.05

0.05

0.06

0.10

0.25

0.09

WSS = (TN + FN)/N-(1.0-R)

0

0.11

0.28

0.32

0.48

0.55

0.55

0.61

0.75

0.48

0.04