1-1 Discussion: Healthcare Administration Systems

Description

Based on your research, what are some potential solutions you recommend for the United States to resolve or improve this quality, access, or cost issue?In response to your peers, review their initial posts. Are there any possible solutions that they did not consider? Are there any other issues they did not consider?Articles attached:

Don't use plagiarized sources. Get Your Custom Assignment on
1-1 Discussion: Healthcare Administration Systems
From as Little as $13/Page

Unformatted Attachment Preview

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 10, OCTOBER 2023
10451
Cost-Sensitive Learning for Medical Insurance Fraud
Detection With Temporal Information
Haolun Shi , Mohammad A. Tayebi , Jian Pei , Fellow, IEEE, and Jiguo Cao
Abstract—Fraudulent activities within the U.S. healthcare system cost billions of dollars each year and harm the wellbeing of
many qualifying beneficiaries. The implementation of an effective
fraud detection method has become imperative to secure the welfare of the general public. In this article, we focus on the problem
of fraud detection using the current year’s Medicare claims data
from the perspective of utilizing temporal information from the
previous years. We group the data into temporal trajectories of
the key covariates and base our feature engineering around these
trajectories. For effective feature engineering on the temporal data,
we propose to use the functional principal component analysis
(FPCA) method for analyzing the temporal covariates’ trajectory
as well as the distributional FPCA for extracting features from
the empirical probability density curve of the covariates. Moreover, we introduce the framework of cost-sensitive learning for
analyzing the Medicare database to allow for asymmetrical losses
in the confusion matrix, such that the classification rule reflects
the realistic tradeoff between the fixed cost and the fraud cost.
The issue of class imbalance in the database is tackled through
the random undersampling scheme. Our results confirm that the
trained classifier has a reasonably good prediction performance
and a significant percentage of cost savings can be achieved by
taking into account the financial cost.
Index Terms—Centers for medicare & medicaid services,
cost-sensitive learning, functional principal component analysis,
functional data analysis.
I. INTRODUCTION
S TECHNOLOGY in medical research advances and
healthcare services continue to improve in the United
States, increasing costs follow as a result. Fair access to healthcare services has thus become a pressing issue that impacts the
general population in the United States. To help alleviate the
financial strain on people to purchase their medical services,
the US government created a national healthcare insurance
program named Medicare, which covers parts or even all of
A
Manuscript received 11 January 2022; revised 30 November 2022; accepted
14 January 2023. Date of publication 30 January 2023; date of current version
15 September 2023. The work of Jian Pei and Jiguo Cao was supported
by the Strategic Partnership Grant of the Natural Sciences and Engineering
Research Council of Canada (NSERC). Recommended for acceptance by L.
Zou. (Corresponding author: Jiguo Cao.)
Haolun Shi and Jiguo Cao are with the Department of Statistics and Actuarial
Science, Simon Fraser University, Burnaby, BC V5A1S6, Canada (e-mail:
[email protected]; [email protected]).
Mohammad A. Tayebi and Jian Pei are with the School of Computing
Science, Simon Fraser University, Burnaby, BC V5A1S6, Canada (e-mail:
[email protected]; [email protected]).
This article has supplementary downloadable material available at
https://doi.org/10.1109/TKDE.2023.3240431, provided by the authors.
Digital Object Identifier 10.1109/TKDE.2023.3240431
the expenditures of medical procedures, prescription drugs,
and equipment. According to data released by the Centers for
Medicare & Medicaid Services (CMS), funding for Medicare accounts for 20% of the annual US healthcare budget with possible
expense recovery of $4–13 billion [1]. One underlying issue that
lurks within the Medicare system is the fraudulent activities that
waste billions of dollars each year at the cost of the wellbeing
of many qualifying beneficiaries. Fraud causes a tremendous
amount of financial strain on the annual US healthcare budget.
It is important to note that this loss implies not only a large
amount of money going into wrong people’s pockets but also
the unavailability of services to many who require a constant
supply of medical service. Fraud accounts for an estimated
spending of $700 billion out of the total budget of $2.7 trillion
in healthcare in 2013 [1]. Therefore, the implementation of an
effective healthcare delivery system and fraud detection method
has become imperative to secure the welfare of the general
public, especially the elderly population, who is in dire need
of affordable healthcare services.
Though a considerable amount of effort has been put into
reducing fraudulent activities, we have not seen a significant
relief on the financial strain. The primary fraud detection method
involves investigators searching through a great number of files
and records to discover suspicious activities [2]. Unfortunately,
this method has already become outdated and less effective, as
a massive amount of data regarding healthcare transactions are
generated each year. As an increasing volume of healthcarerelated data become storable and accessible, more sophisticated
data manipulation tools and machine learning methods need to
be implemented in order to extract useful information from a
pool of data and improve the detection of healthcare fraudulent
activities. Many government programs and organizations have
declared the importance of finding an effective fraud detection
method; in particular, the CMS joined the cause of improving
fraud detection and made available online a series of datasets
named “Medicare Provider Utilization and Payment Data” [1].
These datasets released by CMS consist of primarily three
parts: Physician and Other Supplier (Part B), Part D Prescriber
(Part D), and Referring Durable Medical Equipment, Prosthetics, Orthotics, and Supplies (DMEPOS). These datasets comprise a large range of claims submitted by healthcare providers
to Medicare, thereby providing a complete and thorough representation of the cost-related activities in the Medicare program.
The Part B dataset is related to the average cost amount for the
procedures performed, the Part D is related to the prescribed
medication, and DMEPOS related to the issued supplies.
1041-4347 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on January 14,2024 at 17:28:49 UTC from IEEE Xplore. Restrictions apply.
10452
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 10, OCTOBER 2023
The focus of this paper is on fraud detection using data of the
Fee-For-Service payment of Medicare that involves physicians
submitting claims for medical services directly to Medicare.
Using the Medicare claim data, we have considered several
strategies for feature engineering: first, we group the data into
temporal trajectories of the key covariates and create numerical summaries as the trajectory-specific features; second, we
introduce the methodology of functional principal component
analysis (FPCA), a useful tool for analysis of temporal data,
for effective feature extraction from the trajectories; third, we
conduct a distributional FPCA on the empirical probability
density curves of the drug-level features, and obtain predictive
features from it. Based on the engineered features, our modeling
is different from the previous studies on Medicare data. We
specifically adopt the framework of cost-sensitive learning to
reflect the tradeoff between the fixed cost associated with fraud
investigation and the amount of fraud. Moreover, one issue in
training the classifier using the data set is the class imbalance.
In our constructed data set, only 0.025% of the cases are fraudulent/positive. This creates challenges for the training of the
machine learning algorithm. We use the random undersampling
approach with the varying class ratio to handle the issue of class
rarity.
We experiment with four machine learning algorithms to train
the classifier. Our results show that based on the undersampling
dataset, the trained classifier has a reasonably good predictive
performance. More importantly, under the cost-sensitive approach, a significant percentage of cost saving can be achieved by
taking into account the financial cost. In terms of the cost-saving
percentage, the best algorithm achieves the largest saving percentage of around 55%, i.e., when compared with the case where
no fraud investigation is conducted, using such a classifier may
save the cost by more than 50%. In contrast, the traditional non
cost-sensitive approach tends to have a much smaller cost saving
percentage. In fact, under certain algorithm and undersampling
class ratio, the non cost-sensitive approach could lead to an even
higher cost than doing no fraud detection at all.
A major novel aspect that distinguishes our work from existing methods is the use of functional principal component
analysis to extract useful information from the data. Functional
principal component analysis is a highly efficient and effective
technique for detecting the primary direction of variations in
longitudinal trajectories and extract predictive information from
them. It has been applied across various scientific fields such as
medicine, finance, genetics, behavioral science and ecology [4].
To the best of our knowledge, no prior work has ever applied the
functional principal component analysis to fraud detection in the
Medicare data set, and thus our work fills an important research
gap. Most conventional approaches only use standard numeric
function (e.g., minimum, mean and maximum) and one-hot
encoding for constructing numerical covariates from the raw
data. We utilize two classes of functional principal component
analysis to extract information from (a) the temporal trajectories
of a raw numerical feature and (b) the empirical distribution of
the numerical feature. These covariates may extract new insight
from the data.
Another major novel aspect of our approach is cost-sensitive
learning, which to the best of our knowledge, has not been
previously applied to fraud detection in the Medicare data set.
By incorporating the cost measure, our method is able to guide
the fraud detection towards the suspicious case that would
lead to the highest monetary recovery of the fraud cost.
The cost-sensitive learning method is meaningful and pragmatic for healthcare fraud detection and has important policy
implications.
The rest of the paper is organized as follows. Section II
provides a discussion on existing works related to Medicare
data and cost-sensitive fraud detection methods. The content
and structure of the data set are described in the Section III.
Section IV explains the three strategies for creating predictive
features from the data and discusses the approaches for handling
class imbalance as well as the framework of cost-sensitive learning. Section V elaborates the performance metrics and presents
the analysis of the Medicare data set. Section VI discusses
the policy implication of the proposed methods and finally,
Section VII summarizes the paper with a short discussion.
II. RELATED WORKS
We review the existing studies on healthcare fraud detection
that have been conducted using the Medicare datasets released
by CMS. The common objective shared by all these studies is to
detect fraudulent activities using a machine learning approach.
Depending on the types of techniques used, we classify the
related works as follows.
Supervised Models define healthcare fraud detection as
a binary classification to distinguish fraudulent behavior from
non-fraudulent behavior. Bauder et al. [7] explored how abnormalities in physicians’ activities may point to possible fraudulent
activities using the Naive Bayes algorithm, such that physicians’ suspicious actions such as submitting claims data outside
their specialties can be detected. Bauder and Khoshgoftaar [7]
proposed to estimate the expected cost amount of different
types of medical services and then compute its discrepancy
to the actual amount paid to mark possible fraudulence. The
multivariate adaptive regression splines are identified to be the
best-performing model for estimating the expected cost amount.
Herland et al. [10] validated the performance of the model
constructed in the previous studies with Part B and LEIE datasets
and strived to improve the model performance through feature
selection, removal of certain specialties, and specialty grouping.
They concluded that the strategy of removing certain specialties
that involved several procedures could significantly improve the
model performance. Herland et al. [5] created a combined dataset
from Part B, Part D, and DEMPOS datasets, and used various
machine learning methods to detect fraud in Medicare. The
authors evaluated the performances of random forest, gradient
boosting and logistic regression on each of these three datasets
and also the dataset obtained by grouping all the parts. The
results presented in this paper show that the performance of
all classifiers improves significantly using the integrated dataset
and logistic regression outperforms all other studied models.
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on January 14,2024 at 17:28:49 UTC from IEEE Xplore. Restrictions apply.
SHI et al.: COST-SENSITIVE LEARNING FOR MEDICAL INSURANCE FRAUD DETECTION WITH TEMPORAL INFORMATION
Existing solutions in this category only apply classic machine
learning applications to classify normal and fraudulent samples,
and what differentiates these works from each other is not about
the methodical part but how they preprocess and integrate the
data and extract learning features from it. The work presented
by Herland et al. [5] outperforms the other approaches because
of fusing information from different sources. The advantage of
the supervised learning models mainly lies in its accuracy (in
terms of AUC).
Unsupervised Models generally aim to identify and highlight data points that deviates from the overall pattern of the
data. Khurjekar et al. [15] presented an unsupervised learning
approach based on a multivariate regression model. They set a
residual threshold and applied clustering to residuals that are
above the threshold. Sadiq et al. [16] introduced the patient rule
induction method, an unsupervised learning method to detect
fraudulence by marking anomalies indicated by higher modes in
the datasets. Bauder and Khoshgoftaar [7] presented an outlier
detection model which is based on Bayesian inference using
a sub-dataset derived from the 2012–2014 Part B Medicare
dataset specifically focusing on dermatology and optometry
claims. In their work, probabilistic programming is used to
produce probability distribution and creates credibility intervals
to evaluate the precision of outlier prediction. One of the major
challenges that unsupervised solutions need to address is the
class imbalance issue. Johnson and Khoshgoftaar [6] studied
different existing resampling techniques for imbalanced classes
using the CMS data. The authors concluded that not only maintaining sufficient representation of the majority class plays more
important role than reducing the level of class imbalance but also
downsampling the majority class to reach balanced proportions
can degrade classification performance. While the unsupervised
models can point to abnormalities in the data, its disadvantage lie
in the relatively low accuracy in comparison with the supervised
solutions.
Deep Learning Models have yielded outstanding results
in different fields, and has become an indispensable part of
data-driven solutions. Deep learning models can be potentially
used both in supervised and unsupervised ways to solve the fraud
detection problem. While modern deep-learning-based solutions
show promising solutions for different applications, fraud detection has not received enough attention from the scholars. As the
only work in this domain, Johnson and Khoshgoftaar [6] applied
various deep learning models to the combined CMS dataset
for fraud detection with a focus on addressing class imbalance
issues. They evaluated the significance of identifying optimal
decision thresholds in case of having imbalanced training data.
The authors of this work noted improvement over the existing
methods used by Herland et al. [5]. Applications of deep learning
in healthcare fraud detection is in its infancy, and offers many
interesting research directions to pursue.
For example, to address fraud detection as an anomaly detection problem, one can employ deep learning models in two
ways: to learn feature representation of normality, and to develop an end-to-end anomaly scoring approach, as discussed
by Pang et al. [29]. In the first approach, general-purpose deep
learning models such as autoencoders and generative adversarial
10453
networks can be used to learn a representation of given data, and
by capturing the essential underlying data regularities, these
models are capable of detecting anomalies. Learning feature
representation can be optimized based on an anomaly measure,
such as the distance of anomalous samples from normal samples,
or it can be formulated as a one-class classification problem. In
the second approach, the goal is to learn an anomaly scoring
approach directly. These approaches aim at devising a novel
loss function to learn anomaly scores. Moreover, different deep
learning models such as Recurrent Neural Networks (RNNs),
Long Short-term Memory (LSTMs), and Gated Recurrent Units
(GRUs) can be used as a supervised solution to classify fraudulent and normal data samples.
Other Perspectives In addition to the previously studied
models, researchers have explored healthcare fraud detection
problem from other perspectives. Liu et al. [30] used providers
and clients geo-location information for healthcare fraud detection. The underlying hypothesis in this work was that medicare clients prefer to use health service providers located in a
relatively short distance specifically when they are senior or
disabled, and the long distance between the service provider and
client locations may indicate a fraudulent activity. Chandola et
al. [12] elaborated on the challenges in analyzing the healthcare
claims datasets from Texas and identifying fraudulent physicians. They also discussed the potential to utilize text mining
and temporal analysis for detecting fraud from big healthcare
datasets. Ko et al. [11] applied linear regression to examine the
correlation between patient visits and utilization payment, with
a concentration on urology. Branting et al. [13] proposed graphbased methods for fraud detection. Two types of algorithms
were applied: one estimates the similarity between fraudulent
and non-fraudulent providers’ activities; the other estimates
the risk propagation from physicians according to geospatial
collocation. Another highlight of the work by Branting et al. [13]
is that they refined the fraud labels by filling in the missing NPI
from the National Plan & Provider Enumeration System registry
website.
III. DATA
We focus our study on the Part D dataset, which provides
information on the prescription drugs the physicians entered into
an electronic medical record system in a certain year. Five years
of data are available from 2013 to 2018 on the CMS website.
Each row in the data set corresponds to the information related to
a certain drug administered by a certain physician under a certain
specialty type, i.e., the three columns which together uniquely
define a row are the physician, the specialty type, and the brand
name of the drug. The unique identifier for each physician is the
NPI, and each physician may have more than one specialty type.
Under a certain combination of physician and specialty types,
multiple rows pertaining to the information of a certain drug
are available in the data set. In addition to a drug’s brand name
and generic name, the drug-related information in the data set
includes its total cost, the total number of claim count, the total
number of beneficiaries, total 30-day fill count, and total daily
supply under the physician and specialty in that given year. We
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on January 14,2024 at 17:28:49 UTC from IEEE Xplore. Restrictions apply.
10454
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 10, OCTOBER 2023
refer to these numeric features as “key covariates”, which are
used for constructing predictive features in the model.
For the binary labeling of fraud, we obtain the list of fraudulent
physicians and their NPIs from the List of Excluded Individuals
and Entities (LEIE) on the website of the Office of Inspector
General’s (OIG). The database is updated monthly to provide
a list of physicians whose exclusions are currently in effect (as
of March 2020). We use the NPI as the unique identifier in the
LEIE data to link and map back to the Medicare Part D database,
such that the fraudulent physicians can be identified.
IV. MODELING
A. Feature Engineering
1) Trajectory-Specific Feature of Key Covariates: We perform the analysis on the level of physician and specialty, i.e.,
given a physician and his/her specialty, our goal is to predict
whether the physician committed fraud, using the numerical
features related to all the drugs administered by the physician and
under that specialty type. Five years of Part D data are available
from 2013 to 2018. For a given year, we group the data by
physician’s NPI and specialty type and compute the group-level
minimum, maximum, mean, median, standard deviation, and
summation of all the key covariates. The 5-year trajectories
of these group-level numerical summary quantities can then
be constructed. Fig. 1 shows the yearly trajectories of a fraud
physician and a non-fraud physician. It is worth noting that
our method is based on the entire temporal trajectories of a
physician, whereas most existing fraud detection applications
related to Medicare data focus on a snapshot of the data, and do
not distinguish the identity of the physician [5], [6], [7]. In this
sense, our method offers a more sound and novel perspective to
the analysis of the Medicare database.
We create trajectory specific features for each trajectory,
such as the trajectory mean, median, maximum, minimum, and
standard deviation. Moreover, to capture the trend or slope
information from the trajectory, a linear regression model is
fitted on the trajectory. The model uses the trajectory value as the
response and the year as the predictor and includes an intercept
and a slope coefficient term. The fitted slope coefficient is used
as a trajectory specific feature.
2) FPCA for Physician-Level Trajectories of Key Covariates:
Functional principal component analysis (FPCA) is a widely
used tool in statistics for analyzing temporal data [23]. It
achieves dimensionality reduction by summarizing the information in the temporal trajectories into a series of functional
principal component (FPC) score. In our constructed data set,
the temporal trajectories of key quantities over the past 5 years
are used as the target for performing FPCA. The FPC scores
extracted from such an analysis are then used as the predictors
in the model.
We model the trajectories of a particular key quantity as
independent realizations from a stochastic process X(t) and
let Xi (t) denote the trajectory realization of the ith subject.
Let μ(t) = E(X(t)) and K(s, t) = Cov(X(s) − μ(s), X(t) −
μ(t)) denote the mean function and the covariance function,
respectively. Based on the Karhunen-Lovève decomposition,
Fig. 1. Yearly trajectories of the maximum total claim count under fraud and
non-fraud physician. Compared with the non-fraudulent cases, the trajectories
of the fraudulent cases tend to have more extreme outliers that have much higher
values than the population.
Xi (t) can be expressed as
Xi (t) = μ(t) +

ξik φk (t),
(1)
k=1
where φk (t) is the kth eigenfunction, and ξik is the associated
FPC score for the ith subject. The eigenfunctions should satisfy

φk (t)φj (t) = δkj ,
(2)
T
where δkj = 1 if k = j and 0 otherwise.
The FPC score is defined as

(Xi (t) − μ(t))φk (t)dt.
ξik =
(3)
T
The magnitude of ξik represents the degree of similarity between the Xi (t) − μ(t) and the eigenfunction φk (t). The mean
and variance of the distribution of ξik are E(ξik ) = 0 and
Var(ξik ) = λk , where λ1 ≥ λ2 ≥ · · · ≥ 0.
To obtain the FPC estimates and FPC scores, we perform the
following procedures.
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on January 14,2024 at 17:28:49 UTC from IEEE Xplore. Restrictions apply.
SHI et al.: COST-SENSITIVE LEARNING FOR MEDICAL INSURANCE FRAUD DETECTION WITH TEMPORAL INFORMATION
10455
1) Do a local linear regression to obtain a smoothed estimate
(t).
of Xi (t), denoted as X
ni
1
2) Calculate μ
(t) = n i=1 Xi (t). The sample covariance
function is given by
n

(t, t ) = n−1
i (t) − μ
i (t ) − μ
K
X
(t)
X
(t )
i=1
=

λj φj (t)φj (t ) ,
(4)
j=1
{
λj , j ≥ 1} are the estimated eigenvalues, and {φj (·), j ≥
1} the estimated eigenfunctions. Both are obtained by
t ).
spectral decomoposition on the K(t,
3) Finally, the FPC scores ξik are obtained by ξik =
(t)}φk (t)dt
T {Xi (t) − μ
The obtained FPC scores can be used as covariates or predictors in the machine learning model. As an example, we consider
the following functional logistic regression model.
T
Pr(Y = 1|Xi ) = Ψ
β(t)Xi (t)dt ,
(5)
Fig. 2. The first three FPCs of the yearly trajectories of the sum of the total
claim count in the Medicare database.
0
where Ψ(x) = exp(x)/(1 + exp(x)) is the logistic function.
Based on the basis {φk (t) : 1 ≤ k < ∞}, we can expand β(t) as ∞ β(t) = φk (t)βk , (6) k=1 T where βk = 0 β(t)φk (t)dt is the basis coefficient. Through functional principal component analysis, rewrite T T ∞ Xi (t)β(t)dt = {μ(t) + ξik φk (t)}β(t)dt 0 0 = = ≈ T 0 T 0 T 0 k=1 μ(t)β(t)dt + ∞ T ξik k=1 μ(t)β(t)dt + ∞ 0 φk (t)β(t)dt ξik βk Fig. 3. Example of the yearly trajectories of the sum of the total claim count of a fraudulent case and a non-fraudulent case in the Medicare database. k=1 μ(t)β(t)dt + K ξik βk . (7) k=1 Thus the functional logistic regression model can be rewritten into a usual logistic regression model using the FPC scores as the predictors. Pr(Y = 1|Xi ) = Ψ K ξik βk . (8) k=1 Example 4.1: To illustrate the computation and interpretation of the FPC and FPC scores, we consider the trajectories of the sum of the total claim count. Fig. 2 plots the first three FPCs φ1 (·) to φ3 (·) in (1). The first FPC φ1 (·) is flat and above zero, representing the main degree of variation around the mean function. The second FPC φ2 (·) is downward sloping and crosses the zero axis once. Negative before year 3 and positive after year 3, the φ2 (·) represents the degree of the change in the trajectory after year 3. The third FPC φ3 (·) crosses the zero axis twice at year 2 and 4, i.e., it is negative in [2, 4] and positive in the other two intervals, which can be interpreted as the difference in values during [2, 4] and those in the other time intervals. The computed FPC scores varies from subject to subject, and Fig. 3 shows two examples of the yearly trajectories of the features and their respective first and second FPC scores, one fraudulent case and one non-fraudulent case. As the first FPC score represents the main degree of variation, trajectories with a higher overall value tend to have a larger first FPC score. Since its overall level is higher, the fraudulent case has a larger first FPC score than the non-fraudulent case (13,071 versus 1,838). The second FPC score represents the degree of change, trajectories with a sharper change tend to have a larger (absolute) second Authorized licensed use limited to: Southern New Hampshire University. Downloaded on January 14,2024 at 17:28:49 UTC from IEEE Xplore. Restrictions apply. 10456 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 10, OCTOBER 2023 TABLE I COST MATRIX FOR THE MEDICAL INSURANCE FRAUD PROBLEM Fig. 4. Histogram and empirical probability density curve of the log of total claim count under a physician. FPC score. The fraudulent case has a more evident upward trend while the non-fraudulent case is almost flat, and thus the absolute value of the second FPC score of the fraudulent case is larger than that of the non-fraudulent case (1,867 versus 39). It is worth noting that the sign of the FPC score does not matter because we may invert the FPC functions to obtain oppositely signed FPC scores. 3) Distributional FPCA for Drug-Level Covariates: The drug-related information in the data set includes the drug’s total cost, the total number of claim count, the total number of beneficiaries, total 30-day fill count, and total daily supply under the physician and specialty in that given year. For each combination of physician and specialty type, multiple rows related to the information of a specific drug are available in the data set. The key covariates under each combination of physician and specialty type thus have a unique probability distribution across the past 5 years. For example, Fig. 4 shows a histogram of the distribution of the value of log total claim count under a specific physician. It is of particular interest to derive useful predictive features from the shape of such a distribution. To achieve this goal, we resort to the distributional FPCA proposed by Petersen and Muller [24]. The idea is to conduct FPCA to the probability density curves of the distribution of a specific key covariate under a physician. The challenges of conducting FPCA on probability density curves are: first, ∞a probability density function f has to satisfy the constraint −∞ f (t)dt = 1; second, the probability density functions may have different support. To address these problems, a transformation approach that maps the probability density curves to a Hilbert space of functions through a continuous and invertible transformation can be used. The typical transformations include log quantile density and log hazard transformations. After obtaining the kernel density estimate of the probability density curve, we may perform the FPCA on the transformed Hilbert space and use the FPC scores as the predictor variables. To summarize, the procedure for performing distributional FPCA for a certain drug-level key covariate is as follows. 1) For each physician, create a sub data set consisting of all the rows corresponding to the physician’s NPI. 2) Compute the kernel density estimate of the key covariates in the sub data set, denoted as f (t). 3) Map f (t) to the quantile function Q(t) = F −1 (t), where F (t) is the cumulative distribution function; then, take the derivative of Q(t), resulting in the quantile density function q(t); lastly, take the log of q(t) as our transformed function, denoted as g(t). The function g(t) is supported on [0,1] and is unconstrained. 4) Repeat the above steps to obtain function g for all the physicians. Finally, conduct the regular FPCA on all the function g’s. B. Cost-Sensitive Learning Cost-sensitive learning models develop classification rules that are capable of reflecting the actual savings of detecting a fraudulent activity versus the actual cost of inspecting a suspicious activity. The detection of fraud inherently carries a financial tradeoff. If the potential saving in cost when the fraudulent activities are caught and stopped outweighs the cost of investigating the fraud, then it is beneficial to conduct the investigation. Otherwise, it might not be a worthwhile decision to start the investigation if the potential “gain” from a successful fraud intervention is too small. Thus, a framework of learning models called cost-sensitive learning is proposed to reflect the actual financial cost involved in binary classification problems. Various cost-sensitive learning methods are proposed, primarily in the area of credit card fraud detection [17]. For fraud detection problem, we define the confusion matrix of our binary classification model. A fraudulent case is defined as positive and non-fraudulent case as negative, and TP and FN respectively stand for the numbers of true positives and false negatives. The true positive (TP) case is where the predicted positive cases is truly fraudulent; the false positive (FP) case is where the predicted positive case is in fact legitimate. The true negative (TN) case is where the predicted negative case is legitimate; the false negative (FN) case is where the predicted legitimate case is actualy fraudulent. Under the tradition