Computer Science Question

Description

For this assignment, you will examine five peer-reviewed quantitative research articles with different data collection and analysis plans and write a paper about the data analysis conducted in each article. Address all components for each article before moving on to the next article.

Don't use plagiarized sources. Get Your Custom Assignment on
Computer Science Question
From as Little as $13/Page

Your paper should address the following components:

Describe the data analysis method(s) used.
Evaluate the appropriateness of the data analysis method (hint: focus on the extent to which it addressed the research questions and the limitations of the method).
Provide a perspective on the amount of detail provided by the researcher (hint: focus on statistical assumption tests, discussion of data issues and cleaning (e.g., missing values and outliers), criteria for assessing statistical significance, conclusions are aligned with the statistical results).
Assess the reproducibility of the study.

Length: 10-12 pages, not including title and reference pages

References: Include a minimum of 8 scholarly resources


Unformatted Attachment Preview

Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic
Health Record Research:
An Approach for Validation
and Quantitative Bias
Analysis for Imperfectly
Ascertained Health
Outcomes Via Diagnostic
Codes
Neal D. Goldstein1 Deborah Kahal2,3 Karla Testa3,4 Ed J. Gracely1,5
Igor Burstyn1,6
1Department of Epidemiology and Biostatistics, Dornsife School of Public Health, Drexel
University, Philadelphia, Pennsylvania, United States of America,
2William J. Holloway Community Program, ChristianaCare, Wilmington, Delaware, United States
of America,
3Sydney Kimmel College of Medicine, Thomas Je
erson University, Philadelphia, Pennsylvania,
United States of America,
4Westside Family Healthcare, Wilmington, Delaware, United States of America,
5Department of Family, Community, and Preventive Medicine, College of Medicine, Drexel
University, Philadelphia, Pennsylvania, United States of America,
6Department of Environmental and Occupational Health, Dornsife School of Public Health, Drexel
University, Philadelphia, Pennsylvania, United States of America
Published on: Apr 28, 2022
DOI: https://doi.org/10.1162/99608f92.cbe67e91
License: Creative Commons Attribution 4.0 International License (CC-BY 4.0)
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
ABSTRACT
It is incumbent upon all researchers who use the electronic health record (EHR), including data scientists, to
understand the quality of such data. EHR data may be subject to measurement error or misclassification that
have the potential to bias results, unless one applies the available computational techniques specifically created
for this problem. In this article, we begin with a discussion of data-quality issues in the EHR focusing on health
outcomes. We review the concepts of sensitivity, specificity, positive and negative predictive values, and
demonstrate how the imperfect classification of a dichotomous outcome variable can bias an analysis, both in
terms of prevalence of the outcome, and relative risk of the outcome under one treatment regime (aka
exposure) compared to another. This is then followed by a description of a generalizable approach to
probabilistic (quantitative) bias analysis using a combination of regression estimation of the parameters that
relate the true and observed data and application of these estimates to adjust the prevalence and relative risk
that may have existed if there was no misclassification. We describe bias analysis that accounts for both
random and systematic errors and highlight its limitations. We then motivate a case study with the goal of
validating the accuracy of a health outcome, chronic infection with hepatitis C virus, derived from a diagnostic
code in the EHR. Finally, we demonstrate our approaches on the case study and conclude by summarizing the
literature on outcome misclassification and quantitative bias analysis.
Keywords: electronic health record, data quality, bias, validation, hepatitis C, International Classification of
Diseases
1. Introduction
Electronic health records (EHRs) are an appealing source of health information for researchers, including data
scientists. EHRs capture data recorded during a health encounter, including patient demographics, laboratory
orders and results, medical imaging reports, physiologic measurements, medication records, caregiver and
procedure notes, and diagnosis and procedural codes (Pollard et al., 2016). The EHR itself can be considered
an open cohort representing patients who have engaged with the health care system, or more specifically, the
catchment of the EHR (Gianfrancesco & Goldstein, 2021). As such, the EHR contains a depth of information
on a breadth of individuals.
In any application of EHR data for secondary analysis, there is a need to understand the quality of the data.
After all, EHRs were not originally designed for research. They were intended for medical record keeping,
scheduling, and billing purposes (Hersh, 1995). At one extreme, the researcher may treat such data at face
value, and assume completeness and accuracy. At the other extreme, the researcher may view the data as
wholly unusable, and discard it from analysis completely. Both approaches are far from ideal. Treating data at
face value leaves the analysis prone to information bias: either mismeasurement of continuous data or
3
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
misclassification of categorical data. Treating data as unusable omits potentially vital information from the
analysis. This introduces the possibility of information or selection bias when omitted records are
systematically different from the retained ones, and at the very least, it needlessly reduces precision of
estimates.
Assuming we wish to retain as much data as possible for analysis, accuracy of these data must be determined.
Many researchers in the United States use International Classification of Diseases (ICD) codes for ascertaining
clinical morbidities. Researchers have found that while presence of a code is a likely indicator of true disease
status, the absence of such a code is less reliable for capturing the absence of disease (Goff et al., 2012;
Schneeweiss & Avorn, 2005). In other words, specificity of ICD codes is high, while sensitivity is low. This is
further compounded by differences in coding standards by clinical specialty (Gianfrancesco et al., 2019), the
use of ‘rule out’ diagnostic codes (Burles et al., 2017), as well as the theoretical concern of ‘upcoding,’ or
recording wrong diagnoses for the purposes of greater reimbursement (Hoffman & Podgurski, 2013).
The extent and impact of misclassification of a health outcome can be understood through a validation study
with accompanying quantitative bias analysis. While these methods are well known in fields like epidemiology,
they are nonetheless infrequently used (Hunnicutt et al., 2016; Lash et al., 2014). An instructive summary of
common measurement error and misclassification in epidemiology was given in Funk & Landi (2014) and
readers are advised to refer to it for a broader overview of the topic as it applies to errors in both outcomes and
covariates. We also acknowledge that this problem pervades many fields and there is a rich literature outside of
epidemiology and biostatistics worth consulting; for example, see Blackwell et al. (2017) and Schmidt and
Hunter (1996).
It is our intention with this article to demonstrate an approach to a validation study and quantitative bias
analysis for outcome misclassification assessed via diagnostic codes, motivated from a real-world case study of
data derived from the EHR. We seek to connect theory with an applied example and provide a generalizable
algorithm for those faced with similar outcome misclassification problems when using EHR data.
2. Theoretical Impact of Errors in Diagnosis on Analysis
We turn briefly to theory to show how misclassification may bias estimates. We demonstrate this for both
calculation of prevalence and relative risk (RR), where we assume a binary exposure Z and misclassification is
independent of Z (i.e., nondifferential outcome misclassification). We also assume that the outcome is observed
as binary W (i.e., ICD code present or absent in the EHR) and relates to X (true health outcome) by SN and SP.
SN and SP do not depend on prevalence of X and fully describe misclassification probabilities.
The observed probability (prevalence) of the outcome under the above-specified conditions is p = p(W = 1) =
rSN + (1-r)(1-SP), where r = p(X = 1) is the true prevalence. In other words, the observed prevalence is made
up of true cases that are detected plus uninfected individuals who are falsely identified as a case.
4
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
Obviously, true and observed prevalence are not guaranteed to be the same, and any analysis that relies on
quantifying the number of affected people may be wrong. Standard errors (SE) of estimates may also be
affected by misclassification because they are equal to (r(1-r)/n)0.5 for true and (p(1-p)/n)0.5 for the observed,
in a sample of size n. For example, suppose in a study of n = 100 subjects, the perfect diagnostic test is
expected to estimate prevalence as r = 10% (SE 3.0%). However, if an imperfect test with SN = 0.7 and SP =
0.9 was applied, the observed prevalence is expected to be p = 16% (SE 3.7%). Counterintuitive examples
abound, such that in the above scenario, if true prevalence is 25%, then we expect no bias in estimate of
prevalence and its standard error because the number of true cases missed exactly equals the number of
uninfected falsely classified as infected. The observed and true prevalence are equal when r = (1-SP)/(2-SNSP), leading to a situation where the correct population average rate of diagnosis is obtained even though many
wrong people were diagnosed!
If one is interested in how prevalence varies by a group membership, we need to introduce notation for such
group or exposure. The observed probability of outcome after treatment Z = i (i = 1 for treated and =0 for
untreated) is pi = p(Wi = 1) = riSN + (1-ri)(1-SP), where ri = p(Xi = 1), that is, true probability of outcome
under the ith treatment. It follows that observed RR is expected to be RR* = (r1SN + (1-r1)(1-SP)) / (r0SN + (1r0)(1-SP)), which is not always equal to the true RR of r1/r0 (Green, 1983). Although it is difficult to intuit the
impact of this nondifferential misclassification in a general case, it is clear that RR* tends to be unbiased when
SP is nearly perfect, regardless of sensitivity. When Z confers no change in risk of X, the estimate of RR under
nondifferential outcome misclassification is unbiased, but the observed RR* = 1 does not imply true RR = 1.
This demonstrates that one is never justified in claiming that there is a true proportional increase in risk when
none is observed only because the outcome is misclassified.
The matters become more complex when SN and SP vary by the exposure, that is, there is differential outcome
misclassification with respect to the exposure. Overall, it is not advisable to guess the impact of such
misclassification on the bias in the estimate of an exposure’s effect (and even less advisable to predict impact
on uncertainty in the estimates and hypothesis tests). There is always a legitimate uncertainty in practice as to
whether misclassification is nondifferential, because SN and SP are estimated from validation studies such as
ours (typically expensive and therefore small, with nonignorable sampling errors) and not known as
constants. Within the range of uncertainty bounds on SN and SP, differential misclassification becomes the
most defensible default assumption. Even if data are acquired in a manner that precludes the flow of
information between evaluation of outcome and assigned exposure (e.g., done by independent care providers in
EHR), differential misclassification can arise by chance alone or due to categorization of truly continuous
metrics. A related, albeit distinct, concept, is that of dependent misclassification, where multiple variables
under study are misclassified, and their probability of being misclassified is dependent upon the correct
classification of another variable (Brennan et al., 2021; Lash et al., 2009).
5
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
In many real-world problems, estimation of RR is not the final aim, but investigators are rather interested in the
burden of a particular disease and how it can be related to treatment, or unevenly distributed among subgroups
of people. To answer these types of questions, investigators need to know both RR and prevalence of
outcomes. When there is bias in the estimate of RR and the disease is rare, the benefit of treatment in terms of,
for example, expected proportion of people cured, can be severely biased (Hsieh, 1991). The bias is potentially
even more severe and difficult to anticipate if we are trying to estimate impact of misclassified diagnosis on
chance of some distal outcome, such as costly or hazardous treatment or complications of disease: both the
effect estimate and prevalence become biased, often leading to substantial undercounting of attributable
fractions (Burstyn et al., 2010; Wong et al., 2021).
These matters have been extensively covered in the epidemiology literature for some time (Copeland et al.,
1977) yet remain germane to modern analysis and interpretation of EHR data (Desai et al., 2020; Funk &
Landi, 2014). In short, the only certain way to not be misled by bias due to misclassification of the diagnosis is
to account for it in data analysis, replacing qualitative judgment on bias due to imperfections of data with
calculations that capture resultant uncertainty and, ideally, then adjusts for it.
3. Generalizable Approach for Describing and Quantifying
Outcome Misclassification
Continuing with the earlier notation, we let W and X be the measurements of the error-prone binary EHRderived diagnosis and perfectly measured true health outcome, respectively. X is obtained through
validation. In order to proceed, one needs to have an idea of the accuracy of the EHR diagnostic code, which
may come from intuition or expert opinion, existing literature, or a de novo validation study. A validation study
may occur internally, on a subset of the overall patient sample, or externally, from a different set of patients
altogether, provided they are exchangeable with the clinical data under analysis. Sometimes, validation studies
arise naturally and only need to be recognized within existing data, as is indeed the case in our illustrative case
study presented in sections 4–6. Figure 1 depicts the situation where W is known in the EHR, but X is not.
6
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
Figure 1. Diagnosis of a health outcome in the electronic health record and its
relation to the truth. ‘X’ denotes a patient’s true health outcome while ‘W’
denotes a (potentially incorrect) diagnostic code in the electronic health
record. EHR = electronic health record; PPV = positive predictive value; NPV =
negative predictive value.
† Dashed lines indicate that the true status is unknown to the researcher.
To arrive at the needed accuracy parameters (i.e., PPV, NPV, SN, SP, and their complements) one could
conduct a validation study or identify if a subcohort of individuals already exist in the EHR where X is known
for both cases of W (=0 or 1). Provided the subcohort is exchangeable with the full cohort, we can estimate
these accuracy parameters via logistic regression (Cai et al., 2015). The estimates of PPV, NPV, false omission
rate (FOR; the complement of NPV), and false discovery rate (FDR; the complement of PPV) are obtained
from:
logit(X) = β0 + β1 W
(1)
where β0 and β1 are parameters, and X and W are true (measured only in validation study) and observed (on
everyone) binary variables, respectively. To estimate PPV, we expit(β0 ∗ + β1 ∗ ) and to estimate FOR, we expit
(β0 ∗ ), where expit(β) = exp(β) / [1 + exp(β)]. NPV is 1 – FOR and FDR = 1 – PPV. Superscripts * denote
estimates obtained from regression. Precision is estimated by bootstrapping and is conventionally expressed as
95% confidence interval (CI), although more direct options are available in some statistical platforms, such as
7
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
maximum likelihood estimation (MLE) implemented in PROC LOGISTIC in SAS (Cary, NC). For sparse data,
one can substitute exact logistic regression (Wilson & Lorenz, 2015) or Firth’s logistic regression (Puhr et al.,
2017), but bootstrapping is a sensible default approach. The estimates of SN, SP, false positive rate (FPR; the
complement of SP), and false negative rate (FNR; the complement of SN) are obtained by swapping the
regressors in Equation 1:
logit(W ) = α0 + α1 X
(2)
where α0 and α1 are parameters, and X and W are true and observed binary variables, respectively. To estimate
SN, we expit(α0* + α1*) and to estimate FPR, we expit(α0*). SP = 1 – FPR and FNR = 1 – SN.
To consider misclassification differential with respect to a covariate Z, the validation logistic model includes
parameters specific to each combination of W and Z. For example, for binary Z, we may construct the
following validation logistic regression model of (X|W,Z):
logit(X) = β00 + β10 W + β01 Z + β11 W × Z
(3)
leading to the following estimates of the four required predictive values (for strata defined by the value of
exposure Z in the second digit of the subscript): FOR0 = expit(β00 ∗ ), FOR1 = expit(β00 ∗ + β01 ∗ ), PPV0 =
expit(β00 ∗ + β10 ∗ ), PPV1 = expit(β00 ∗ + β10 ∗ + β01 ∗ + β11 ∗ ). NPV = 1 – FOR and FDR = 1 – PPV.
Again, by swapping X and W regressors, one can arrive at estimates of SN and SP:
logit(W ) = α00 + α10 X + α01 Z + α11 X × Z
(4)
leading to estimates (for strata defined by the value of exposure Z in the second digit of the subscript): SP0 =
expit(α00*)-1, SP1 = expit(α00* + α01*)-1, SN0 = expit(α00* + α10*), and SN1 = expit(α00* + α10* + α01* +
α11*).
Extension to more covariates beyond Z is trivial albeit tedious, placing ever-increasing demands on the
validation data to be informative of the strata-specific effects, while at the same time requiring sufficient
sample size. Others have described this problem and supplied a solution in the presence of validation data with
diagnosis used as predictor variable (Tang et al., 2015). The advantage of the presented approach is that the
equality across strata can be tested and a parsimonious model selected using standard regression techniques.
This can help focus efforts to improve the quality of data captured in the EHR in subpopulations where the
issue may be more acute. We present the logistic form of the validation model but any technique that predicts
probabilities should be suitable, for example, probit or log-binomial regressions.
It is also worth mentioning that there are other methods readers may be familiar with for calculating the
parameters needed for a bias analysis from the validation sub-study. For example, the classical 2 x 2 table can
be used to cross tabulate the imperfect and perfect binary health outcome indicators, W and X, respectively (see
8
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
Tables 2 and 3, for example). We have presented but one approach that conveys certain advantages: it is easier
to setup computationally and is more flexible, both to the operationalizing of W and X, as well as adding strata
of Z.
If a validation study is unavailable then one must turn to the literature or expert opinion in order to inform the
validation parameters, for nondifferential outcome misclassification this would include SN and SP (or PPV and
NPV) and for differential outcome misclassification this would include SN and SP estimated at the levels of the
exposure (same for PPV and NPV). To proceed with our approach would require operationalizing the values as
distributions based on a logit transformation. Researchers who face such a situation, or are newer to bias
analysis, are advised to start by applying methods detailed in Lash et al. (2009).
3.1. Probabilistic Bias Analysis of Outcome Misclassification With and
Without the Presence of an Exposure
A probabilistic (quantitative) bias analysis seeks to assess the sensitivity of results due to systematic errors in a
study, while also capturing random errors, both in terms of the magnitude and directionality of estimates
(Gustafson & McCandless, 2010; Lash et al., 2009; MacLehose & Gustafson, 2012; Phillips & LaPole, 2003).
The following is an overview of our approach to identifying and quantifying outcome misclassification using
probabilistic bias analysis of a study aiming to estimate true prevalence of the outcome. A probabilistic bias
analysis of outcome misclassification on prevalence of X would proceed as follows:
1. Estimate coefficients and standard errors of β0 and β1 through application of Equation 1, detailed above, in
the validation model.
2. Calculate π = p(X = 1∣W ) = expit(β̇0 + β̇1 W ), using the imperfect classifier, W, in the main study for
each person in the cohort who does not have X measured, where β̇0 and β̇1 are each sampled from a normal
distribution with the means and variances of β0 ∗ and β1 ∗ , respectively.
3. Simulate potential values of Ẋ from Bernoulli(π).
4. Repeat steps 2–3 many times to obtain a distribution of Ẋ values that reflect simulated true values that would
have been observed if there was no misclassification, informing what values Pr(X) may take given our data
and models. The superscript ‘dot’ stresses that these are simulated values of X, not actual true values. As
such, this is not a true misclassification adjustment, but rather a sensitivity analysis covering plausible
scenarios. Not all simulated values of Ẋ are equally plausible given data and models, but probabilistic bias
analysis does not take this into account (MacLehose & Gustafson, 2012).
This approach can readily be extended to account for an additional covariate, exposure Z, for purposes of
estimating a RR (or odds ratio), though it must be noted that conditioning of misclassification on more than
one covariate appears to be rarely described (though routinely considered in statistics). The extension of the
above algorithm to differential outcome misclassification on RR of X due to binary exposure Z, would proceed
through the following steps:
9
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
1. Estimate coefficients and standard errors of β00 , β10 , β01 , and β11 through application of Equation 3,
detailed above, in the validation model.
2. Calculate π = p(X = 1|W, Z) = expit(β̇0 + β̇10 W + β̇01 Z + β̇11 W × Z) using the imperfect classifier,
W, and exposure, Z, in the main study for each person in the cohort without measurement of X, where
β̇0 , β̇10 , β̇01 , and β̇11 , and are each sampled from a normal distributions with the mean and variances of
β00 ∗ , β10 ∗ , β01 ∗ , and β11 ∗ , respectively.
3. Simulate potential values of Ẋ from Bernoulli(π).
4. Estimate RR
˙ relating Z to Ẋ in the main study that lacks X. The resulting RR
˙ reflects what RR can be due
to misclassification, given data and models. We estimate RR
˙ via Poisson regression with robust standard
errors (Zou, 2004), appropriate for a cohort design; for case-control sampling, one may also estimate the
odds ratio through logistic regression at this step.
¨ ) from distribution of normal
5. To account for random errors in estimation of RR, we sample log( RR
˙ ), var( RR
˙ )).
(log( RR
6. Repeat steps 2–5 many times to obtain a distribution of RR
¨ that reflects possible values of what would have
been observed in absence of misclassification, given our data and models. We again note that not all
simulated values are equally plausible given data and models, but that is not considered in a probabilistic
bias analysis.
One approach to account for the situation where implausible estimates arise during the bias analysis simulation
would involve weighting parameter estimates of interest ( RR
¨ in our example) by likelihoods of models that
they are derived from, which is akin to likelihood weighting (Russell & Norvig, 2003, p. 514). We demonstrate
this in our case study in section 6, although we emphasize that this is a stopgap measure for a general problem
of probabilistic bias analysis as practiced in health research: lack of mechanism to account for ‘poor’
simulation realizations. The only solution that has been offered is to discard simulation realizations that are
incompatible with data, for example, leading to undefined effect estimates such as negative odds ratios (Lash et
al., 2009). However, discarding all undefined estimates does not offer a complete solution reflective of the
reality of complex data and models if it unreasonably treats all remaining simulation realizations as equally
likely. Gustafson & MacLehose (2012) propose to bootstrap distribution of retained simulation realizations,
while Stayner et al. (2007) utilize weighting by partial likelihood. Yet all of this falls short of consideration of
full likelihood, starting with plausibility of simulated misclassification parameters and this would ultimately
lead to fully Bayesian approach, not probabilistic (Monte Carlo) bias (sensitivity) analysis. This would be an
appropriate next step in either refining our method or leading down a different path of adapting existing
Bayesian methods.
Further, there is a rich statistical literature on how to approach the general problem of bias analysis and we seek
here to merely illustrate the idea and implementation behind one of the simplest ones, acknowledging that it
does not adjust for bias, but rather provides an idea of its systematic impact while also capturing random errors
(Jurek et al., 2013; Lyles et al., 2011). For recent guidelines on meeting analytical challenges of error-in-
10
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
exposure, for example, when diagnosis is used to predict a future event, the reader is referred to these articles
(Keogh et al., 2020; Shaw et al. 2020). All such approaches require information on SN and SP that can be
derived from modeling W as a function of X and Z, as described earlier. Other methods exist that involve
evaluation of likelihood functions associated with each ‘imputation’ (Edwards et al., 2013; Högg et al., 2017).
4. Case Study of a Misclassified Diagnostic Code in the EHR
Chronic hepatitis C virus infection (HCV) causes considerable morbidity and mortality in the United States
and, as of 2015, was estimated to affect 2–4 million people nationally (Ly et al., 2016; Polaris Observatory
HCV Collaborators, 2017). Groups particularly at risk for infection include the ‘baby boomer’ birth cohort
(1946–1964), people who inject drugs, institutionalized individuals, and those who are homeless,
undocumented, or incarcerated (Denniston et al., 2014). With the recent widespread introduction of direct
acting antivirals, the ability to treat and cure HCV is markedly improved over interferon-based regimens that
were less effective, had a worse side-effect profile, and required longer therapy (Manns et al., 2006). Further,
as restrictions surrounding HCV therapy continue to be further lifted, including the ability of nonspecialists to
prescribe it and lack of a urine drug screen test requirement, combined with the 2020 recommendation for onetime screening among adults, many additional patients will become eligible for treatment (Breskin et al., 2019;
Marshal et al., 2018; US Preventive Services Task Force [USPSTF] et al., 2020). Thus, there is now a
justification to reengage patients to confirm and treat HCV infection.
As data scientists, we may be engaged in a variety of research aims pertinent to HCV, and at our disposal are
data abstracted from the EHR. For example, the health care center may wish to know the prevalence of HCV
among their patients for the purposes of allocating resources for testing and treatment. Such an analysis may
also be useful for the health department in order to ascertain community prevalence of HCV based upon the
catchment of the center. Or perhaps the health care center would like to know for a given patient, what is the
likelihood of that individual having HCV based on presence or absence of a corresponding diagnosis in the
EHR: the positive and negative predictive values of the ICD codes, respectively. Finally, perhaps the clinic
would like to know how likely is it that a certain exposure is associated with a diagnosis of HCV, for purposes
of intervening on the exposure.
As detailed in section 1, the use of an ICD code to ascertain accurate HCV status may be subject to
misclassification, including both false negatives—a missing code—or false positives—an inaccurate code.
First, patient self-report may have been the reason for documentation in the EHR. Results from the National
Health and Nutrition Examination Survey 2001–2008 indicate a general lack of awareness and suboptimal
knowledge of HCV infection in the United States (Denniston et al., 2012). Second, documentation may have
occurred due to a positive screening, as opposed to a positive confirmatory test. A positive screening test (i.e.,
reactive HCV antibody) indicates past or present infection; it does not prove active HCV, which requires
presence of viremia as detected by polymerase chain reaction or viral load assays. Relatedly, patients may have
spontaneously cleared the virus (Aisyah et al., 2018). Third, the diagnostic code may have been recorded to
11
Harvard Data Science Review • Issue 4.2, Spring 2022
Data Quality in Electronic Health Record Research: An Approach for Validation and
Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
rule out HCV contingent on further testing, and fourth, a diagnosis may have been recorded by the clinician in
free-text notes but never appeared as an ICD code. Taken together, there are multiple reasons why it may be
dubious to rely on the EHR ICD code alone to identify those with HCV.
We have several options with how to proceed. First, we may conduct the analysis naïvely, and use the data at
face value. Alternatively, we may recognize the limitations of the data and perform quantitative bias analysis to
determine the impact of the misclassification. This can be as straightforward as using simple algebra to correct
the observed measures or employing more sophisticated sensitivity analyses or simulations to describe
plausible ranges of the effect estimates (Funk & Landi, 2014; Gustafson, 2004).
5. Description of a Validation Approach for the HCV Diagnostic
Code
Our case study employed two data collection periods at an urban federally qualified health center (FQHC).
First, we assembled a cohort of adult patients ≥18 years of age seen between November 1, 2016, and October
31, 2018. This time period corresponded with the FQHC’s definition of “active” patients seen ≥1 time in the
past 2 years and predates a change in the HCV testing policy. During this time, the FQHC engaged in riskbased screening for HCV, based on either known or disclosed risk factors, or symptomology. Hereafter this
cohort is referred to as the ‘risk-based cohort.’ The apparent (observed) presence or absence of HCV was
determined by abstracting the following ICD-10-CM codes from the EHR: B18.2 (Chronic viral hepatitis C),
B19.20 (Unspecified viral hepatitis C without hepatic coma), B19.21 (Unspecified viral hepatitis C with
hepatic coma), and B19.2 (Unspecified viral hepatitis C). These codes were chosen a priori based on the coding
practice of the FQHC and were believed at the time to capture the preponderance of cases, albeit imperfectly.
In the second data collection period, for purposes of validating the diagnostic code, we assembled a cohort of
patients seen at the FQHC from January 1, 2019, through July 31, 2019, for whom there was no EHR-recorded
diagnosis of HCV as per ICD-10-CM codes listed above. In the 2 months prior to formation of this second
cohort, universal HCV screening was implemented for all adult patients ≥18 years of a