Description
Write a 3-page paper in APA 7 format answering the following questions and utilizing the attached references.The balance between ethical considerations and technological advancement is an ongoing topic of discussion. There are those who express concerns about privacy and efficiency while others appreciate the potential for improved diagnostics. What are the implications of using AI for decision-making in healthcare, and how can a balance be struck between algorithmic assistance and human expertise to ensure optimal patient outcomes? What is the method used? Why is it important, what’s the value my study is going to bring to the table?What is the gap in knowledge (what is not yet known)? Utilize the following questions as the “gap.”1. The role of healthcare professionals in collaborative decision-making processes. Methods to integrate human expertise into AI decision making processes, creating a symbiotic relationship between technology and healthcare professionals. 2. Long Term effects on patient outcomes. 3. Real world impact of AI- driven decisions on treatment effectiveness. 4. Patient perspectives and trust in AI decisions. 5. Developing explainability and interpretability for AI driven healthcare decisions. How can we make AI algorithms more transparent and understandable to healthcare professionals and patients?
Unformatted Attachment Preview
www.nature.com/npjdigitalmed
ARTICLE
OPEN
The impact of inconsistent human annotations on AI driven
clinical decision making
Aneeta Sylolypavan1, Derek Sleeman
2
, Honghan Wu
1,3 ✉
and Malcolm Sim4
1234567890():,;
In supervised learning model development, domain experts are often used to provide the class labels (annotations). Annotation
inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical
image, diagnostics, or prognostic status), due to inherent expert bias, judgments, and slips, among other factors. While their
existence is relatively well-known, the implications of such inconsistencies are largely understudied in real-world settings, when
supervised learning is applied on such ‘noisy’ labelled data. To shed light on these issues, we conducted extensive experiments and
analyses on three real-world Intensive Care Unit (ICU) datasets. Specifically, individual models were built from a common dataset,
annotated independently by 11 Glasgow Queen Elizabeth University Hospital ICU consultants, and model performance estimates
were compared through internal validation (Fleiss’ κ = 0.383 i.e., fair agreement). Further, broad external validation (on both static
and time series datasets) of these 11 classifiers was carried out on a HiRID external dataset, where the models’ classifications were
found to have low pairwise agreements (average Cohen’s κ = 0.255 i.e., minimal agreement). Moreover, they tend to disagree more
on making discharge decisions (Fleiss’ κ = 0.174) than predicting mortality (Fleiss’ κ = 0.267). Given these inconsistencies, further
analyses were conducted to evaluate the current best practices in obtaining gold-standard models and determining consensus. The
results suggest that: (a) there may not always be a “super expert” in acute clinical settings (using internal and external validation
model performances as a proxy); and (b) standard consensus seeking (such as majority vote) consistently leads to suboptimal
models. Further analysis, however, suggests that assessing annotation learnability and using only ‘learnable’ annotated datasets for
determining consensus achieves optimal models in most cases.
npj Digital Medicine (2023)6:26 ; https://doi.org/10.1038/s41746-023-00773-3
INTRODUCTION
Classical supervised machine learning assumes the labels of
training examples are all correct, ignoring the presence of class
noise and inaccuracies1. In healthcare, this assumption may not
hold even when highly experienced clinicians provide these
labels, due to the degree of noise, observer subjectivity and bias
involved. If neglected in the training of a Machine Learning
Decision-Support-System (ML-DSS), annotation inconsistencies
may result in an arbitrarily partial version of the ground truth,
and to subsequent unpredictable clinical consequences, including
erroneous classifications2–4.
Ideally, class labels are obtained via a knowledge acquisition
process, involving choosing the appropriate “gold-standard” to
base these ground truth class labels on, to build a KnowledgeBased System (KBS). Within the healthcare and biomedical setting,
clinical domain experts are often used to provide these labels5.
However, in many clinical areas, these ground truths are hard to
find and define, due to the pathophysiological, diagnostic and
prognostic uncertainties inherent to medicine2,6.
Cognitive Psychology has shown experimentally that humans (&
hence experts) make “slips”, for example, due to cognitive
overload and due to biases. On the other hand, the field of
expert systems and KBS has assumed that for (most) disciplines
“slip-free” highly skilled experts exist, and the key task is how such
experts can be objectively or subjectively identified. However,
increasing evidence from the literature shows, on common sets of
(e.g., classification) tasks, groups of experts do often significantly
disagree with each other5,7,8. In 2021, Kahneman et al.9 published
1
a major contribution to this topic called Noise: a flaw in Human
Judgment, which convincingly makes the case that fellow experts
in many disciplines do differ. These authors9 make distinctions
between judgments and opinions where with the former, experts
are expected to provide a response from a (fixed) set of
alternatives, whereas opinions are much more open-ended. In
this paper, we deal with tasks that require the various experts to
make judgments.
There are four main sources of annotation inconsistencies2,8,10–17: (a) Insufficient information to perform reliable
labelling (e.g., poor quality data or unclear guidelines); (b)
Insufficient domain expertise; (c) Human error (i.e., slips & noise);
(d) Subjectivity in the labelling task (i.e., judgment & bias). In this
study, where highly experienced clinical annotators were used
and the labelling task was well understood with 60 instances to
annotate, we believe the main source of inconsistency investigated is the interrater variability resulting from observer bias,
judgment, and noise. Throughout this paper, we define ‘noise’ as
system noise, i.e. unwanted variability in judgments that should
ideally be identical9.
Kahneman et al.9 notes between-person noise (i.e., interrater
variability) in the medical profession is most common when
clinicians are required to make judgments, as opposed to
following a routine or largely mechanical diagnosis (i.e., consisting
of set tests or quantitative rules); Kahneman et al. outline a series
of examples. Jain et al.18. found that in diagnosing breast
proliferative lesions, agreement amongst pathologists only had a
‘fair’ agreement (Fleiss’ κ = 0.34). Regier et al.19 showed highly
trained specialist psychiatrists only agreed on a diagnosis of ‘major
Institute of Health Informatics, University College London, London, United Kingdom. 2School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, Scotland,
UK. 3Alan Turing Institute, London, United Kingdom. 4School of Medicine, Nursing and Dentistry, University of Glasgow, Aberdeen, Scotland, UK. ✉email: [email protected]
Published in partnership with Seoul National University Bundang Hospital
A. Sylolypavan et al.
2
a
A
B
C
D
E
b
Normal physiological parameters without use of drugs like adrenaline, only small
amounts of fluids, and low levels of inspired oxygen
Relatively stable (i.e., near normal physiological parameters) with low levels of
support
Either more stable than patients in category D or the same level of stability but on
lower levels of support (e.g., fluids, drugs and inspired oxygen)
Patient more stable than those in category E but is likely to be receiving considerable
amounts of support (e.g., fluid boluses, drugs such as adrenaline, and possibly high
levels of oxygen)
physiological parameters (e.g., blood pressure and heart rate) having extreme values
(low or high) and likely to be receiving high levels of support
PseudoID
Timepoint
Adrenaline
Noradrenaline
FiO2
SpO2
MAP
HR
01
21/08/2017 08:00
0
0.6
0.3
98
70
56
Annotation
C
02
16/07/2017 02:00
0
0
0.30
99
94
98
A
1234567890():,;
Fig. 1 Description of the QEUH annotated training data. a ICU-PSS annotation categories. b Example instances of a QEUH ICU annotated
dataset.
depressive disorder’ 4–15% of the time (Fleiss’ κ = 0.28)20. Halford
et al.21 showed minimal agreement among EEG experts for the
identification of periodic discharges in continuous ICU EEG
recordings (average pairwise Cohen’s κ = 0.38). Moor et al.22
describe the significant issues of disagreements on the definition
of sepsis – a leading causes of death in ICUs worldwide. Zhang
et al.23 investigate Emergency Department (ED) clinicians’ referrals
to inpatient teams and found for 39.4% of the admissions, patients
were admitted to a different inpatient team than that initially
referred to by the ED. Xia and Yetisgen-Yildiz24 showed almost no
agreement between clinical annotators identifying pneumonia
from chest x-ray reports (Cohen’s κ = 0.085), and that “medical
training alone is not sufficient for achieving high inter-annotator
agreement”. The presence of noise is clearly pervasive across a
variety of medical domains, including ICU settings.
Using such clinicians to establish the Knowledge Base results in
a ‘shifting’ ground truth, depending on which expert(s) are used.
Label noise in training data has been shown empirically to result
in4,11,25–28: decreased classification accuracy, increased complexity
of inferred models (e.g., increasing size of decision trees),
increased number of training samples needed, and a difficulty in
feature selection. To the best of our knowledge, this paper is one
of the first studies that investigates biases/inconsistencies among
a sizeable number (11) of clinicians in acute clinical decisionmaking scenarios (ICU settings), using an external validation
dataset.
Frequently, two approaches are used to address class label
noise in ML development. The first involves utilising data
cleansing methods, where noisy labels are identified and
relabelled/removed before training. The second involves using
label noise-tolerant algorithms, where label noise is accounted for
during learning10,12,29. Moreover, applying these methods may
result in the loss of subtle and potentially important differences
between annotators’ class labels. (This latter issue is addressed in
the Further work section). There is some informative literature
discussing methods to improve the quality of clinical labels,
including establishing clear annotation guidelines24 and modelling annotation errors of the human experts30. However, most of
this literature considers image classification tasks – there is a lack
of empirical studies around improving the quality of symbolic
labels within medical annotation tasks.
The aim of this study is to assess the (in)consistency of human
annotations for AI model development and the impact on realworld clinical decision-making in ICU settings. The overall class
label quality is strongly impacted by disagreements between
annotators. The focus of this study is on investigating the impact
and effective utilisation of experts’ disagreements (via their
annotations) in developing ML models rather than resolving the
deviation of their judgments for forming a “ground-truth”. We
conduct extensive experiments demonstrating how differences in
judgments between clinical expert annotators may lead to
classification models with varying performance (therefore varying
npj Digital Medicine (2023) 26
clinical utility), and how to obtain an optimal consensus from such
differences, to facilitate AI driven clinical decision-making.
Specifically, Sleeman et al.5,7 reported clinical experts sometimes
disagree when labelling the severity of an Intensive Care Unit (ICU)
patient on a five-point scale (A-E), based on the values of six
clinical variables. The current study addresses the question: ‘What
are the implications of these differences in judgment on the
resulting classifier model performance and real-world ICU clinical
decision-making?’ We therefore proposed the hypothesis that the
M classifiers, derived from datasets individually labelled by M
clinical experts, produce consistent classifications when applied to
a relevant external dataset. The objectives of this study are to: 1)
Build classifiers from the 11 individually annotated Queen
Elizabeth University Hospital (QEUH) ICU datasets. 2) Evaluate
the classifiers’ performances on real-world discharge outcomes
(discharged alive from ICU and died in ICU) in an external ICU
dataset: HiRID. 3) Assess various approaches for dealing with
annotation inconsistencies, as these frequently create sub-optimal
AI models.
RESULTS
This study focuses on a scenario of using AI technologies for
facilitating a clinical decision-making problem that ICU consultants
encounter on a day-to-day basis, as described below.
Clinical question
Can we use a five-point ICU Patient Scoring System (ICU-PSS) scale
(A-E) to address the question “How ill is the patient?”, where E
represents severe cardiovascular instability, and A represents a
relatively stable patient. Figure 1a provides a description of the
ICU-PSS scale and Supplementary Table 1 contains further details.
The training dataset was obtained from the Glasgow Queen
Elizabeth University Hospital (QEUH) ICU patient management
system. It contains 60 data instances described by six clinical
features: two drug variables (Adrenaline and Noradrenaline) and
four physiological parameters (FiO2, SpO2, mean arterial pressure
(MAP) and heart rate (HR)). Note, the six variables are those which
clinicians regularly use in the ICU to assess how ill a particular
patient is. Example annotations are shown in Fig. 1b. The QUEH
dataset may contain trauma and non-trauma ICU patient data.
Our main aim is to assess the (in)consistency of human
annotations for AI model development and the impact on realworld clinical decision-making in ICU settings. This is broken down
into the following aspects.
i. Evaluation setup: (a) ML models are developed using the
QEUH annotated datasets; (b) external validation datasets
are prepared, and all model performance assessments are to
be conducted on these datasets.
ii. Consistency quantification: We choose Cohen’s κ scale31,32
and Fleiss’ κ33,34 to measure the extent to which annotators’
Published in partnership with Seoul National University Bundang Hospital
A. Sylolypavan et al.
AI models assign the same category to the same instance.
Higher values on these scales suggest stronger levels of
agreement. Cohen’s scale can be summarized as: 0.0–0.20
(None); 0.21–0.39 (Minimal); 0.40–0.59 (Weak); 0.60–0.79
(Moderate); 0.80–0.90 (Strong); > 0.90 (Almost Perfect).
iii. Impact on real-world decision-making: we chose two real
ICU decision-making scenarios, both of which are binary
classification tasks. First, whether a patient should be
discharged from ICU in the next hour; second, whether a
patient is going to die in ICU within the next hour. We
investigate two methods of external validation – one using
hourly snapshots of patient data (i.e., static data) and
another using time series data (i.e., temporal data).
iv. Evaluate current “best practices” of obtaining the goldstandard: we evaluate (a) whether there is a “super expert”
whose judgment should be used as the gold-standard when
disagreements occur; (b) whether a consensus can be
obtained from all expert judgments to achieve the goldstandard?
An overview of the experimental approach described above is
found in Fig. 2.
Quantifying the consistency of expert judgments
Recall that the central hypothesis for this study is: the M classifiers,
derived from the datasets individually labelled by M clinical
experts, produce identical classifications when applied to a
relevant external dataset.
Decision tree (DT) and random forest (RF) classifiers were built
from the QEUH annotated datasets, in part as both are popular
choices in clinical machine learning literature. DT was selected as
the resulting tree plots can be used to infer the decision-making
process of the learnt models, as well as compare the different
complexities between annotator models. RF was used to compare
whether more powerful models (compared to DT) would make the
External Validation Dataset
Investigation
HIRID
inconsistency less significant – which we show in later subsections
is not the case.
11 classifiers were derived from each of the 11 consultants’
annotated datasets, which contained data for 6 clinical variables
(Adrenaline, Noradrenaline, FiO2, SpO2, MAP, HR) and the severity
class labels (A-E). The annotation labelling (A-E) across the 60
training instances differs across the 11 annotators, as shown in
Fig. 3a. Note, we tried class-balancing techniques to balance the
class labels within the annotated datasets prior to training,
however this did not result in a significant performance difference
(see Supplementary Table 2). Therefore, we decided to build
classifiers using the original annotated datasets. The 11 consultants who annotated the QEUH datasets were randomly
assigned anonymous code names (C1-C11) following the annotation exercise in the previous Sleeman et al.5 study. These code
names are referred to throughout this paper. Each consultant’s
corresponding RF classifier is referred to as Cn-RF, where n refers
consultants 1–11.
The trained models predict ICU-PSS labels (A-E) for a patient,
indicating their level of severity. A standard internal validation
experiment across multiple annotated datasets involves first
establishing a ground truth, most likely through taking a majority
vote across all annotators for each instance. Then each trained
consultant model would be run against this ground truth to
establish internal validation performance. We developed and
utilised a different method, more relevant to this study, where
each trained model was run against the original annotations it
learnt from – thus, these internal validation results indicate the
‘learnability’ of the original annotated datasets, i.e., how well the
associations between the attribute variables and provided
annotations can be learnt, and in turn how easily the annotator’s
decision-making can be reproduced. These internal validation F1
(micro) score ranges between 0.50 to 0.77 across the 11 RF
classifiers, as seen in Fig. 5a. The feature importance across the six
predictive variables differs across the classifiers, as shown in Fig. 4.
External Validation Dataset
Preparation
HIRID
MIMIC-III
CONSISTENCY
QUANTIFICATION
EXTERNAL VALIDATION
Cohen’s κ, Fleiss’ κ
EXPERIMENT 1
Queen Elizabeth
University Hospital
Training Dataset
MODEL DEVELOPMENT
Static HIRID validation
dataset
Decision Tree
Classify patient discharge
status
Random Forest
EXPERIMENT 2
ANALYSIS OF CURRENT
PRACTICES
Super-expert
Static & Temporal HIRID
validation datasets
Classify patient discharge
status – compare performance
on static vs time-series data
Majority-Vote
NOVEL CONSENSUS
SEEKING METHOD
Evaluate learnability
before seeking
consensus
INTERNAL VALIDATION
K-fold cross validation
INVESTIGATING
PATTERNS OF CHANGE
Feature Importances
Temporal HiRID
validation dataset
Fig. 2 Overview of the experimental approach, outlining the dataflow and key analytical steps. The left component (with three boxes)
illustrates the model derivation including dataset, models and internal validation methods. The top component with two green boxes
denotes the external validation dataset selection and preparation. The middle component (circled by a dashed line) shows the external
validation experiments. The right component (with four pink boxes) describes the external validation experiment details including
inconsistent measurements, consensus seeking methods and decision making considering changing patterns.
Published in partnership with Seoul National University Bundang Hospital
npj Digital Medicine (2023) 26
3
A. Sylolypavan et al.
4
a
b
c
Fig. 3 Distributions of the 11 consultants’ annotations on the training dataset and predicted labels on the external validation dataset.
a Annotation distributions across all consultants’ (C1-C11) labelled QEUH training datasets. b Predicted label distributions across the
consultants’ RF multiclass models, run on the HiRID validation dataset. c Pairwise Cohen’s κ values across all consultant pairs for the predicted
labels made by the multiclass RF models on the external HiRID validation dataset.
With all the external validation experiments, the focus is on
predicting the two extreme clinical scenarios (discharged alive
from ICU or died in ICU). In this first external validation
experiment, the trained models were run on a HiRID test dataset,
to predict severity labels (A-E) on 2600 instances containing data
for the same 6 clinical variables (1300 of these instances
corresponds to patients who are discharged alive from that ICU,
and a further 1300 patients who died in that ICU). As our focus is a
binary (discharge status) classification task, we mapped the
multiclass A-E severity label classifications to binary discharged/
died classifications as follows:
●
●
In the last hour before a patient is discharged (alive) from ICU,
their classification on the ICU-PSS scale is ‘A’.
In the last hour before a patient dies in ICU, their classification
on the ICU-PSS scale is ‘E’.
Note, in the HiRID dataset, not all patients with an ‘A’
classification were discharged within the next hour. Similarly,
not all patients with an ‘E’ classification died within the following
hour; many patients upon arrival to ICU are extremely ill and are
often rated as an ‘E’.
The predicted labels across the 2600 HiRID test instances differ
across the annotators, as shown in Fig. 3b. It is clear from
reviewing this diagram that there is a great deal of variation in the
classifications of the experts’ models, with only a few models
having comparable labels. The corresponding pairwise interannotator agreements (IAAs) for these A-E predicted labels, using
Cohen’s scale, range between −0.01 (Low/None) to 0.48 (Weak)
across the annotator models, and are shown in Fig. 3c. The
average pairwise Cohen’s κ score is 0.255 (Minimal agreement).
Fleiss’ κ for these predicted labels is 0.236 (Fair agreement). Note,
IAA is used as an abbreviation for “Inter-Annotator Agreement”
throughout this paper.
These results were obtained using the Random Forest
classifiers35, trained on the 11 consultants’ annotated datasets.
The corresponding classifiers obtained using the Decision Tree
npj Digital Medicine (2023) 26
algorithm25 gave comparable results, see ref. 36. Classifiers trained
using XGBoost and SVM also gave comparable results to the RF
models, as shown in Supplementary Fig. 3.
Investigating inter-annotator agreement across the ICU
discharge status classifications
Further, we consider the actual decisions which the classifiers from
the 11 QEUH consultants made concerning the HiRID validation
dataset which you will recall, contained 1300 instances which
correspond to the patient being discharged alive in the next hour
(i.e., ICU-PSS label ‘A’, as outlined in the mapping above) and 1300
instances where the patient died in the ICU within the following
hour (i.e., ICU-PSS label ‘E’). These results are summarised in Fig.
5a. Recall, the trained classifiers predict ICU-PSS classification
labels (A-E) for a patient, indicating their level of severity. In this
first external validation experiment, we treat the trained models as
predicting three classes: CL1 = A, CL2 = B/C/D, & CL3 = E. The
external validation F1 scores reported in Fig. 5a are calculated
using the F1 micro average – computing a global average F1 score
by counting the sums of the True Positives, False Negatives, and
False Positives. F1 score37 is the harmonic mean of the precision
and sensitivity of the classifier, where a higher score indicates a
higher performing model.
Figure 5a reports the number of correctly classified “Discharged
Alive” and “Discharged Dead” labels across all 11 classifiers. These
results suggest that C10 is the ‘most reluctant’ to discharge
patients, with the lowest number of correct “Discharged Alive”
classifications, referring to the number of correctly predicted
admissions discharged alive within 1 h. In contrast, C2 and C4 are
the ‘most likely’ to discharge patients, with the highest number of
correct “Discharged Alive” cases.
Scenario 1: Patients discharged alive from ICU. Focusing only on
the instances where the patient was discharged alive, we observe
the average pairwise inter-annotator agreement (Cohen’s κ) is 0.21
Published in partnership with Seoul National University Bundang Hospital
A. Sylolypavan et al.
(Minimal agreement). Fleiss’ κ for these predicted labels is 0.174
(Slight agreement).
Scenario 2: Patients died in ICU. Focusing now on the instances
where the patient died in ICU, we observe the average pairwise
inter-annotator agreement (Cohen’s κ) is 0.28 (Minimal agreement). Fleiss’ κ for these predicted labels is 0.267 (Fair agreement).
This suggests clinical domain experts agree more when
predicting mortality, compared to making discharge decisions.
Note, due to the low number of ‘E’ labels across the annotated
datasets, limited insights and comparisons can be inferred for
these predicted “died” labels. In future related studies we will
acquire more class-balanced datasets to address this issue.
Figure 5b shows an example one consultant’s (C1) confusion
matrix plot, outlining the distribution of the RF predicted labels
when run on the HiRID validation dataset. Predicted labels 0–4
correspond to ICU-PSS labels A-E, respectively. True label = 0
corresponds to the patient being discharged alive from ICU within
the next hour (i.e., ICU-PSS label ‘A’); and true label = 4
corresponds to the patient having died in ICU within the following
next hour (i.e., ICU-PSS label ‘E’). This confusion matrix shows C1RF correctly classified the patient as ‘Discharged Alive’ for 337
Fig. 4 Feature importance distributions across the Random Forest
models, trained on the 11 consultants’ (C1–C11) QEUH annotated
datasets. The x-axis lists the 11 classifiers and the y-axis is the
importance value with a range from 0 to 1, where 1 denotes the
largest importance.
a
Annotator
Internal Val.
F1 micro
External Val.
F1 micro
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
0.567
0.717
0.617
0.700
0.583
0.600
0.550
0.767
0.517
0.650
0.500
0.218
0.501
0.379
0.425
0.198
0.148
0.346
0.376
0.350
0.183
0.208
I
cases, and correctly classified the patient as ‘Discharged Dead’ for
229 cases. The trained models were treated as predicting three
classes: CL1 = A, CL2 = B/C/D, & CL3 = E.
As the QEUH training data consists of hourly snapshots of
patient physiological/pharmacological readings, we ran this
external validation experiment with a HiRID validation dataset
containing similarly static data. However, Fig. 5a shows the
external validation performance is significantly lower than the
internal validation performance. This could indicate that extreme
decision-making at ICUs (predicting discharge/death) may require
continuous monitoring (i.e., using time series data) – this is
explored further in the later subsection ‘Assessing Time Series
External Validation Methods’. Additionally, the annotation distributions shown in Fig. 3a suggest that human annotators may be
less likely to choose extreme label categories (i.e., A or E) when
presented with a multiclass labelling task, which in turn results in
poor performance when predicting these scenarios.
For the classifiers that had high internal validation performance (C2-RF, C4-RF, C8-RF), we can infer that these consultants’
annotated datasets were highly learnable (recall, ‘learnability’
indicates how well the associations between the input variables
and provided annotations can be learnt, and in turn how easily
the annotator’s clinical rationale can be reproduced). Despite
having similarly high internal validation performance, consultants C2 and C8 differ in their initial QEUH annotation
distributions and subsequent feature importance distributions,
as outlined in Fig. 3a and Fig. 4, resulting in differing
distributions in their predicted labels on the HiRID validation
dataset. As shown in Figs. 6a and 6b, the C2 QEUH annotated
dataset consists of 3.3% ‘C’ labels and 10.0% of ‘E’ labels,
whereas the C8 annotated dataset consists of 36.7% ‘C’ labels
and 1.7% ‘E’ labels. The inferred C2-RF classifier predicted labels
consists of 1.4% ‘C’ labels and 11.2% ‘E’ labels, whereas the
inferred C8-RF classifier predicted labels consists of 12.5% ‘C’
labels and 1.5% ‘E’ labels. Overall, the C2-RF and C8-RF classifiers
have minimal agreement across their classifications when run on
the HiRID dataset (pairwise Cohen’s κ = 0.27).
b
External Val.I
‘Discharged
Alive’
‘Discharged
Dead’
337
1064
967
1070
308
375
900
970
556
156
329
229
239
19
34
207
9
0
7
354
320
213
Correctly classified ‘Discharged Alive’ and ‘Discharged Dead’ labels
Fig. 5 Comparison of internal and external validation performances of the RF models across all 11 consultants (C1-C11). a Internal and
external validation performances of the consultants’ RF models. For each classifier, the number of correctly classified “Discharged Alive” and
“Discharged Dead” labels on the HiRID external dataset are reported. b External validation confusion matrix plot for Consultant 1, showing the
HiRID dataset true labels and RF model predicted labels across the five classes (A-E): 0 = ICU-PSS label ‘A’, 4 = ICU-PSS label ‘E’.
a
QEUH Annotated
Label
A
B
C
D
E
C2 (%)
C4 (%)
C8 (%)
55.1
20.0
3.3
11.7
10.0
46.7
28.3
13.3
6.7
5.0
40.0
16.7
36.7
5.0
1.7
b
HiRID Predicted
Label
A
B
C
D
E
C2-RF (%)
C4-RF (%)
C8-RF (%)
71.0
6.0
1.4
10.3
11.2
72.0
13.5
12.5
0.4
1.5
72.0
13.5
12.5
0.4
1.5
Fig. 6 QUEH annotations across the highly learnable expert-labelled datasets and resulting RF predicted label distributions. a Annotation
distributions across the QEUH labelled datasets for C2, C4 & C8. b Predicted label distributions generated by classifiers C2-RF, C4-RF & C8-RF
when run on the HiRID validation dataset.
Published in partnership with Seoul National University Bundang Hospital
npj Digital Medicine (2023) 26
5
A. Sylolypavan et al.
6
Analysis of current practices on obtaining Gold-standard
In this subsection, we evaluate two types of best practices in
obtaining gold-standard from multiple domain experts:
(a) Super expert: use a more senior annotator’s labels or use
decisions from an adjudicator when disagreements happen; (b)
Majority-Vote: Seek consensus from all different judgments as the
ground-truth38–40.
Regarding the “super expert” assumption, we could not make
this assessment directly, as we do not know which annotators are
more senior, due to the anonymization of the dataset. To work
around this, we use the correlation between internal and external
model performances as a proxy indicator. This is because, if the
super-expert assumption holds, one could assume that models
with higher (or lower) performance internally are likely to have
higher (or lower) performances in external validations. Figure 5a
lists the internal and external validation results. The Pearson
correlation between the two results is 0.51, meaning they are not
strongly associated. The results of this analysis suggests that the
super-expert assumption, i.e., that the gold-standard can always
be provided by the most senior colleague, is not always true. We
observe that even the well performing models in internal
validation do not perform as well in external datasets (e.g., C4RF and C8-RF). In fact, the initial annotations of the QEUH dataset
shows similar levels of disagreement amongst the consultants as
shown on the HiRID validation dataset. As we show later, a
superior model can often be achieved by considering diverse
judgments in a selective majority-vote approach.
Additionally, we investigated taking a consensus of all experts’
annotations (a common practice). Figure 5a shows the varied
internal validation performance across the QEUH datasets,
indicating a difference in learnability across the 11 annotated
datasets. The models with higher internal validation performance
indicate easier learnability (e.g., C8), which potentially reflects
more consistent annotation rules and a simpler decision-making
process. Models with lower internal performance indicate a poorer
learnability, with potentially less consistent / more complex
classification rules (e.g., C7).
To assess the reliability of taking a consensus, we compared the
external validation performance of a consensus Majority Vote (MV)
model, built from the majority-vote labels across all 11 annotated
datasets, to a Top Majority Vote (TMV) model, built from the
majority-vote labels across the top-performing consultant models
(where internal validation F1 micro > 0.7). Figure 7 shows TMV (F1
micro = 0.438) performs significantly better than MV (F1 micro =
0.254). In fact, TMV outperforms almost all the consultant models.
This indicates it is important to assess learnability of each domain
expert’s judgments before creating a consensus, because poorly
learnable (expert) judgments often lead to poor performances.
across the period of time prior to assessment (e.g., across the
previous 5–10 h). We, therefore, incorporated a time-series
component into this second external validation experiment and
investigated how this impacts the performance of the QEUH
classifiers. We believe this experiment is a more clinically relevant
assessment of the expert models, as it provides the more realistic
task of classifying discharge status given patient parameter
readings over a period of time (rather than a single snapshot).
Within this second external validation experiment, we compared the performance of DT classifiers, trained on the QEUH
annotated datasets, on