Python Data Analysis

Description

Jana Center:

Don't use plagiarized sources. Get Your Custom Assignment on
Python Data Analysis
From as Little as $13/Page

to know more about Jana Center, please visit their website https://jana-sa.org/

Attached is sample paper, and dataset for further analysis (analysis shall be through python)

for the project objectives and the questions, below is one recommendation, you can as analyst change and adjust as needed.

Project objectives:

To identify and analyze the patterns and factors influencing loan disbursement and repayment among individuals in different regions of Saudi Arabia. This analysis aims to understand how demographic and socio-economic variables such as age, nationality, marital status, academic qualification, family income, number of dependents, and the sector of the loan influence the loan amount disbursed, the repayment rate, and the timing of loan disbursement.

Specific Questions to Address:

What are the demographic and socio-economic characteristics of individuals who apply for and receive loans in different regions of Saudi Arabia?

How does the loan disbursement amount vary across different sectors (e.g., cooking, salon businesses, makeup services) and demographic groups?

What factors contribute to a 100% repayment rate, and how can these insights inform strategies to improve loan repayment rates across all sectors?

Is there a relationship between the loan period, the amount disbursed, and the repayment rate?

How do external factors, such as the time of year or economic conditions, impact loan disbursement and repayment patterns?

PROJECT GENERAL GUIDELINES:

Singles-spaced.
1.5-2 pages max excluding references
Font size 11.
Two columns page setup.
References should be listed in an acceptable format (for example APA style).
Use headings and sub-headings styles.
Organizes the report in a logical and coherent manner.
Demonstrates effective communication skills with clear and concise writing.
Uses appropriate visualizations and diagrams to support the analysis.
Presents the technical details in a way that is understandable to a non-technical audience.
Provides appropriate citations and references to support claims and findings
Section1: Introduction & Problem Identification (2 pages)
Clearly defines the business problem or question to be addressed.
Describes the relevance and significance of the problem in the business context.
Provides a clear objective or goal for the project.
Identifies the target audience or stakeholders for the analysis.
Demonstrates a clear understanding of the data required to address the problem.
Section2: Background/Review of the Literature (2 pages)
Conducts a comprehensive review of relevant literature, theories, and frameworks related to the problem. (minimum 14)
Identifies and explains the key concepts and variables related to the problem.
Shows an understanding of existing research or similar projects in the field.
Analyzes and presents any existing data sources or datasets relevant to the problem.
Section3: Methodology (2 pages)
Clearly defines the methodology or approach used to address the problem.
Explains the rationale behind the chosen methodology (e.g., descriptive analytics, predictive analytics, prescriptive analytics).
Describes the data collection process and data preprocessing techniques used.
Discusses any assumptions or limitations of the chosen methodology.
Section4: Analysis and Findings
Performs appropriate data exploration and preprocessing techniques to gain insights from and prepare the data for analysis.
Applies relevant statistical or machine learning techniques to analyze the data.
Presents the analysis results in a clear and structured manner (e.g., data visualizations, summary statistics).
Provides a thorough interpretation and explanation of the analysis findings.
Evaluates the accuracy and reliability of the analysis results.
Section5: Discussion and Insights (2 pages)
Interprets the analysis results in the context of the business problem.
Identifies and discusses any patterns, trends, or relationships discovered.
Provides meaningful insights and recommendations based on the analysis findings using business language.
Relates the insights to the original problem statement and project objectives.
Considers any ethical, legal, or social implications of the analysis findings.
Section6: References
Accurately cites all the sources used in the project, following a consistent citation style (i.e., APA style).
Provides in-text citations where appropriate to support claims and findings.


Unformatted Attachment Preview

Branch
Age
Nationality
Mecca
Khamis Mushait
Dammam
29
26
29
Saudi
Saudi
Saudi
MaritalStatus
Married
Single
Single
AcademicQualification
High School
Bachelor
Diploma
Major
administrative
Health administrative
Job
FamilyIncome
administrative
10433
15000
Nursing Specialist
Health administrative 8400
Dependent
s
0
1
2
Area
Area
‫ المنطقة ر‬- ‫الدمام‬
‫الشقيه‬
‫ الرياض‬- ‫الرياض جنوب‬
‫ الغربية‬- ‫مكة المكرمة‬
Dammam – Eastern Province
Riyadh South – Riyadh
Makkah – West
City
Dammam
Riyadh
Makkah
Client Cycle Lending Product
1
1
1
Lending Product
Loan Period
‫بنك التنمية‬-‫المشاري ع الواعدة‬
30
Promising Projects-Development
Bank
‫بنك التنمية‬-‫المشاري ع الواعدة‬
20
Promising Projects-Development
Bank
‫ قرض بداية فردي‬Individual starter loan 16
Entry Date
2021-11-17
2021-08-25
2021-09-22
DisbursementEntryDate
DisbursementDate
2021-11-17
2021-08-25
2021-09-22
2021-11-17
2021-08-25
2021-09-22
AmountSpent
LoanAmount
30,000
20,000
10,000
30,000
20,000
10,000
TotalPaidTotalRemainingForPayment
RepaymentRate
ProjectStatus
30,000
20,000
10,000
0
0
0
100
100
100
Active
Active
Active
MainSector
Commercial
Product
Product
Subsector
Sale of clothing, textiles and leather
Personal/Home Care Products
Apparel & Textiles
Sale of clothing, textiles and leather
Personal/Home Care Products
Predictive Analytics of Attrition Drivers and How to Mitigate It
Employee Flight Risk: Predictive Analytics of
Attrition Drivers and How to Mitigate It
Completed Research Paper
Abstract
Talent acquisition has become a major driver for organizations to stay competitive in the market. Hence,
attrition of talented employees is one of the most significant challenges organizations can encounter. The
objective of this paper is to employ several machine learning techniques to predict employee attrition using
HR data retrieved from IBM website as well as attempt to reveal the most contributing factors to attrition.
After cleaning and preparing the date, we conduct two experiments. In the first experiment, we apply seven
algorithms, namely, logistic regression, naïve Bayes, random forest, k-nearest neighbors, artificial neural
network, support vector machine, and XGBoost and find that logistic regression is considered the best
performing model; achieving a higher rate of accuracy and sensitivity. In the second experiment, we use
hyperparameter tuning and a voting classifier to improve the models’ performance. As a result, we note a
clear improvement in all models and AdaBoost which is used as an alternative to XGBoost shows an
accuracy rate of 86% and sensitivity of 85%. Further analysis conducted on the predictive model, which
reveals that monthly income and overtime compensation are significantly contributing to employee
attrition, suggesting to pay more attention to those factors in order to increase employee’s retention. Lastly,
we communicate our study contributions to enrich both literature and industry.
Keywords
Employee attrition, predictive analytics, HR, hyperparameter tuning, voting classifier
Introduction
Talent acquisition is one of the most important hiring processes for any organization. For an organization
to be successful, it requires the retention of top performers and talent who are loyal, hard-working and
dedicated to achieving the organization’s objectives, missions, and vision. As explained by Bill Gates (1992),
“you take away our top 20 employees and we [Microsoft] become a mediocre company”. Attrition is thus
one of the most significant challenges organizations can encounter (Zhao et al., 2018). Attrition is defined
as the departure of employees from an organization for voluntary or involuntary reasons, including
resignation, termination, death, or retirement. Within the field of human resources (HR), attrition has
gained increased attention as the recruitment of talent and training are expensive endeavors and the
retention of top talent represents a competitive advantage (Fallucchi et al., 2020). It is much more cost
effective to retain employees than continuously hire, onboard, and train new employees.
Employee turnover has been identified as a pivotal factor to curb the growth of organizations (Zhao et al.,
2018). Within any organization, large or small, the human resources department plays an important role in
retaining employees. When an existing employee resigns, it is the responsibility of human resources to
intervene immediately to understand the factors that influence the employee’s decision and address them
accordingly. There are several factors that might contribute to an employee’s decision to resign, such as
developmental opportunities, stress and work life balance, compensation, work-environment,
communication, and supervision (Kossivi et al., 2016).
While the main focus of HR is the management of people within an organization, it also requires the
management of data (e.g. demographics, training records, performance reports, etc.). Traditionally, the HR
The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022
1
Predictive Analytics of Attrition Drivers and How to Mitigate It
data has only been used for reporting purposes and generating basic employees’ statistics (Kumar, 2020).
This limited use is insufficient and does not help organizations make strategic decisions because it does not
provide insights. Thus, HR department needs a better approach to support decision making that utilizes
analytics and in turn can improve employees’ performance, satisfaction, and motivation.
According to Gartner (2022), HR analytics (also known as people analytics) “is the collection and
application of talent data to improve critical talent and business outcomes”. HR analytics enable HR leaders
to develop data-driven insights to inform talent decisions, improve workforce processes, and ensure
employees to have positive experiences. HR analytical tools can provide evidence-based insights into
fundamental questions in areas related to making better hiring decisions, reducing employee attrition, and
increasing employee engagement among others. Some of the major applications that HR analytics are
utilized for are retention, recruiting, employee performance, compensation, and training and development
(Kumar, 2020).
With technological advancements in machine learning, data has become a strategic asset rather than just
static information. Machine learning is a self-learning algorithm that uses data and models to make
predictions. Several researchers have studied machine learning approaches to improve the outcomes of HR
management (Al-Radaideh & Al Nagi, 2012; Chien & Chen, 2008; Li et al., 2011; Zhao et al., 2018). The
literature indicates that machine learning is important within the field of human resources because it can
enhance efficiency and enable data to be transformed into knowledge (Jain & Maitri, 2018; Mishra et al.,
2016; Fallucchi et al., 2020). Utilizing machine learning and its related models can help mitigate critical
issues (e.g. attrition) and optimize HR related activities (Jain & Maitri, 2018; Mishra et al., 2016; Fallucchi
et al., 2020; Zhao et al., 2018).
Attrition is a growing risk for many organizations as it has implications to the health of an organization
(e.g., finances, skills). What makes attrition particularly challenging is that it is difficult to predict, and it
introduces gaps within an organization’s skilled workforce (Zhao et al., 2018). Thus, it is important for
organizations to understand the underlying causes of attrition to adequately implement retention strategies
to prevent employee turnover and skills gaps. Understanding the factors related to attrition, is an important
step in developing solutions to mitigate them.
1. What is the best model to predict attrition?
2. What are the key factors associated with attrition?
As employee turnover is a costly challenge encountered by employers (Frye et al. 2020), the purpose of this
paper is to build a model that predicts attrition by identifying an employees’ risk of leaving a company. The
targeted context of study is the human resources management department, specifically, the workforce
analytics applied to minimize the risk of employee attrition. Identifying drivers of attrition proactively helps
organizations retain talent and high-performing employees. Building on existing literature (Al-Radaideh &
Al Nagi, 2012; Chien & Chen, 2008; Li et al., 2011; Zhao et al., 2018)., this paper focuses on hyperparameter
optimization and soft voting classifier to boost the models’ accuracy. Hence, we take into consideration
multiple evaluation measures to select the best model. Several machine learning algorithms are used to
adequately transform the data and build a model that predicts employee attrition.
Zhao et al. (2018) noted that existing papers utilizing machine learning methods related to predicting
employee turnover tend to be problem-specific and difficult to generalize. Hence, the study can contribute
by 1) expanding the current literature on machine learning and human resources, especially on predicting
employee attrition. This could accumulate the knowledge base in this area and further our understanding
on what factors matter most when investigating employee attrition. Secondly, this work would allow
organizations to make better strategic decisions and better identifying processes for retention to mitigate
attrition as well as paying more attention to what matters most to employee attrition, which basically the
factors that rank high in importance among employees.
The rest of this paper is organized as follows. The literature review section presents an overview of related
works. Methodology section explains the technique used for collecting, processing the data, and designing
the models. Analysis and finding section show the experimental results as well as the findings. Finally, the
discussion and insights section provide a summary of findings, limitations, and future works.
The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022
2
Predictive Analytics of Attrition Drivers and How to Mitigate It
Literature Review
Attrition has become a pain point for many organizations. Attrition does not only affect the confidence and
security of organization employees but it also impacts the financial position of the organization (Jain &
Maitri, 2018). Literature on employee attrition has focused on two categories; voluntary (e.g., quitting) and
involuntary (e.g., termination). Involuntary attrition occurs when an employee is terminated because of
performance issues or misconduct as well as those who are part of a seasonal layoff or overall reduction in
force. Voluntary attrition is when the employee leaves the organization at their own will (Darraji, et al.,
2021). This paper focuses solely on voluntary attrition.
There are many reasons for employee attrition within organizations, some of which organizations can
control and some of which they cannot. Figure 1 shows the most common reasons for employees leaving
their organizations voluntary (Ben Yahia et al., 2021).
External
environment
job satisfaction
Job content
Stress
Compensation
Employee
Attrition
Reasons
co-workers
Figure 1. Employee Attrition Reasons
Several studies have explored the use of machine learning to predict employee attrition. Yedida et al. (2018)
utilized k-nearest neighbors (KNN) algorithm, neural network, decision trees, and logistic regression to
predict whether an employee of a company will leave or not. They used employee performance, average
monthly hours worked, and number of years spent with the company, among others as features of the
predication model and achieved the highest accuracy of 94.32% with KNN model, giving it superiority
compared to other classifiers. Alduayj and Rajpoot (2018) conducted three experiments (the original classimbalanced dataset, synthetic over-sampled, and under-sampled datasets) to predict employee attrition.
They used several machine learning models, including random forests, k-nearest neighbors, and support
vector machines with different kernel functions. Again, KNN classifier achieved the highest performance,
with 93% accuracy rate.
Yigit and Shourabizadeh (2017) compared the performance of various classification techniques in
predicting the churn rate of employees using a fictional dataset prepared by IBM, which had 1470 records
and 35 features. In this study, support vector machine (SVM) algorithm performed better than other
algorithms with an accuracy rate of 88%. Frye et al. (2020) presented some modeling algorithms to predict
employees’ attrition in addition to the ethical consideration of using those models. Among the different
algorithms used in this study, logistic regression was capable of delivering the required results as applied
on different demographics of administrative occupations with 74% accuracy. Kakad et al. (2020) proposed
XGBoost model, which is very effective in terms of memory efficiency, high accuracy, and low running times
for predicting employee attrition, and achieved accuracy of almost 90%. This high accuracy indication for
employee turnover prediction can give assurance for decision makers to undertake preventive actions.
Recently, Fallucchi et al. (2020) predicted employee attrition using Gaussian Naïve Bayes classifier that
produced the best results on real data. They found that the main contributing variables to predict attrition
are monthly income, age, overtime, and distance from home. In the same vein, Qutub et al. (2021)
determined the prediction of employees’ attrition using machine learning models and their findings
suggested that when multiple models are trained for different subsets of features, they will guide to a better
understanding of the attrition prediction.
The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022
3
Predictive Analytics of Attrition Drivers and How to Mitigate It
Within the literature, the focus tends to be on the employee attrition prediction, however it is also important
for an organization to not only predict as soon as possible an employee’s intention to leave but also to
interpret and explain why the employee has the intention to leave.
Contrary to prior research, this paper conducts experiments with different hyperparameter tunings considering overfitting- in human resources dataset to achieve higher performance in predicting employee
attrition.
Methodology
In this section, we show the intended methods to be used for collecting, processing, and analyzing the data.
As per figure 2, each phase depicts the steps used to perform the analysis, while the last phase focuses on
evaluating the developed models
Fictional data set created by
IBM data scientists
Data
Collection and
Exploration
Preprocessing
Outliers
Dummies
Imbalanced data
Feature selection
Normalization
Logistic Regression
Naive Bayes
Random Forest
K-Nearest Neighbors
Artificial Neural Network
Support Vector Machine
Evaluation
Accuracy
Sensitivity
Precision
Experiments
Figure 2. Research Methodology
Data Collection and Exploration
We targeted structured data where it is stored in tabular databases. The data collected was a fictional dataset
created by IBM data scientists, while pulled from a sample dataset on IBM website. The data had 36 features
and 1,470 employee records, each labeled ‘Yes’ or ‘No’. ‘Yes’ means employee left the company, and ‘No’
means employee did not leave the company. Table 1 shows the data structure along with their corresponding
types.
#
1
2
3
Attribute name
Age
Attrition
Business Travel
Type
Numerical
Categorical
Categorical
#
19
20
21
4
5
6
7
8
Daily Rate
Department
Distance From Home
Education
Education Field
Numerical
Categorical
Numerical
Categorical
Categorical
22
23
24
25
27
9
10
11
12
Employee Count
Employee Number
Environment Satisfaction
Gender
Numerical
Numerical
Categorical
Categorical
28
29
30
31
13
14
15
Hourly Rate
Job Involvement
Job Level
Categorical
Numerical
Numerical
32
33
34
Attribute name
Monthly Income
Monthly Rate
Number of
Companies Worked
Over 18
Overtime
Percent Salary Hike
Performance Rating
Relationship
Satisfaction
Standard Hours
Stock Option Level
Total Working Years
Training Times Last
Year
Work Life Balance
Years At Company
Years In Current
Role
Type
Numerical
Numerical
Numerical
Categorical
Numerical
Numerical
Numerical
Numerical
Numerical
Numerical
Numerical
Numerical
Numerical
Numerical
Numerical
The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022
4
Predictive Analytics of Attrition Drivers and How to Mitigate It
16
Job Role
Numerical
35
17
Job Satisfaction
Categorical
36
18
Marital Status
Years Since Last
Promotion
Years With Current
Manager
Numerical
Numerical
Categorical
Table 1. IBM Data dictionary
The dataset was analyzed using visual exploration to gather insights about its distribution. In this analysis,
different graphical methods are used depending on feature type, numerical or categorical. We used bar
charts or pie chart to graph categorical data, while we use histogram, boxplot, and scatterplot for numerical
data.
Figure 3 shows the distribution of the attrition; around 237 employees left the company while 1,233 still
active employees. The graph also indicates imbalanced data. The imbalanced data is a major challenge when
training the data.
Figure 3. Attrition Distribution
As shown in Figure 4, the attrition rate is higher for male employees. Out of 237 employees left the company,
there are 150 male employees compared to 87 female employees. This can give an indication that male
employees tend to leave the company more than female employees.
Figure 4. Attrition/Gender Distribution
Figure 5 below shows the marital status distribution; the attrition rate for single status is higher than
married and divorced. There are 120 single employees, 84 married, and 33 divorced who left the company.
Figure 5. Attrition/Marital Status Distribution
The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022
5
Predictive Analytics of Attrition Drivers and How to Mitigate It
In Figure 6, the attrition rate of employee is higher between 0 to 1o years across organizations. The
employee attrition is lower when employee has more years at the company, while the highest employees
attrition rate is at the second year with 86 employees.
Figure 6. Attrition/Years at Company
Distribution
Data Preprocessing
Data preprocessing is the first step in the data analysis process that takes raw data and transforms it into a
format that can be understood and analyzed by computers and machine learning (Fan et al. 2021). It
consists of the following sub-steps:
Outliers Treatment
Outliers are defined as records with extreme values. Z-score is one of the useful techniques to detect outliers
Anusha et al. (2019). The below formula is used to calculate the z-score value:
( ) =
− ( )
( )
It is considered an outlier if value(Z) is greater than 3 or less than 3. The below boxplot graph, in Figure 7,
identifies the attributes with outliers, that include “years at company”, “total working hours”, “year in the
current role”, “years since last promotion”, and “years with current manager”.
Figure 7. Boxplots Graph Demonstrates Attributes with Outliers
There are different approaches to treat the outlier such as replacing the outliers with either mean or median
value for numerical variables or dropping the observations with outliers in some cases. Hence, we deleted
the observations with outliers if they are due to data entry errors or data processing errors, otherwise we
used the imputation method as it prevents the data loss. As for the imputation, we used the attribute’s
median to replace the outlier values as summarized in Table 2.
Attribute name
Number of
outliers
Median (in years)
YearsAtCompany
25
5
TotalWorkingYears
16
10
YearsInCurrentRole
13
3
YearsSinceLastPromotion
42
1
The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022
6
Predictive Analytics of Attrition Drivers and How to Mitigate It
YearsWithCurrManager
14
3
Table 2. Number of Outliers with the Corresponding Median
Non-Usable Attributes Omission
We removed attributes with a single unique value since they are unusable and ignored by the model, Hence,
this removal process had reduced the dimension of the dataset. The following attributes dropped from the
dataset include: “employee count”, “over 18”, and “standard hours”. Similarly, “employee number”, is a
unique number assigned to employee within the company, and hence has no business sense thus it was
dropped from the dataset.
Dummy Attributes’ Encoding
The dummy attribute represents categorical variables which most machine learning algorithms cannot
handle. In this step, dummy coding should be created to incorporate nominal or ordinal attributes into
machine learning algorithms. Hence, we used one-hot encoding technique to convert categorical variables
into a format that can be used by machine learning algorithms. The basic idea of one-hot encoding is to
create new variables that take on values 0 and 1 to represent the original categorical values (Baijayanta,
2019). The one-hot encoding was applied to the following categorical attributes; “business travel”,
“department”, “education”, “education field”, “environment satisfaction”, “gender”, “job involvement”, and
“job level”.
Data Balancing
As shown in Figure 3 above, the attrition distribution suffered from an imbalanced data, about 16% of the
data (attrition = ‘Yes’). It is obvious that the training data is also be imbalanced and requires random
resampling, therefore, we used random oversampling approach to randomly duplicates the minority class
(attrition = ‘Yes’). The oversampling approach was applied on the training data as Figure 8 compares the
training data before and after oversampling.
Figure 8. Oversampling Training Data (Before Vs. After)
Feature Selection
The goal of this step is to select the features that are highly dependent on the response variable. For
categorical features, we used chi-square test, while used ANOVA (‘Analysis of Variance’) for numerical
features. Both tests employ p-value (≤ 0.05) to determine the significance (Muttakin et al., 2021). In Table
3, features in gray (education, gender, performance rating, and relationship satisfaction) are not
significant (p-value > 0.05) and so not selected While for ANOVA test, the features in gray (hourly rate,
monthly rate, num companies worked, percent salary hike, and relationship satisfaction) are not
significant (p-value > 0.05) and so not selected (Table 4).
Attribute name
p-value
Attribute name
p-value
BusinessTravel
< 0.001 JobRole < 0.001 Department 0.005 JobSatisfaction < 0.001 The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022 7 Predictive Analytics of Attrition Drivers and How to Mitigate It Education 0.545 MaritalStatus < 0.001 EducationField 0.007 OverTime < 0.001 EnvironmentSatisfaction < 0.001 PerformanceRating 0.990 Gender 0.290 RelationshipSatisfaction 0.155 JobInvolvement < 0.001 StockOptionLevel < 0.001 JobLevel < 0.001 Table 3. Chi-Square result for categorical features Attribute name p-value Attribute name p-value Age < 0.001 PercentSalaryHike 0.605 DailyRate 0.029 TotalWorkingYears < 0.001 DistanceFromHome 0.003 TrainingTimesLastYear 0.023 HourlyRate 0.793 WorkLifeBalance 0.014 JobInvolvement < 0.001 YearsAtCompany < 0.001 MonthlyIncome < 0.001 YearsInCurrentRole < 0.001 MonthlyRate 0.561 YearsSinceLastPromotion 0.291 NumCompaniesWorked 0.095 YearsWithCurrManager Table 4. ANOVA result for numerical features < 0.001 Machine Learning Algorithms In our study, we applied 6 different machine learning algorithms. Before showing their applications, we provide a brief description for each. Logistic Regression Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more independent variables. The binary variable can be represented as “0” and “1”. It is named for the function used at the core of the method, the logistic function. Logistic regression is a regression model that fits the values to the logistic function. It is useful when the dependent variable is categorical (Sarker, 2021). Naïve Bayes Naïve Bayes assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. The naïve Bayes model is easy to build and particularly useful for very large data sets. It can be used for binary as well as multi-class classifications. Along with simplicity naïve Bayes is known to outperform even highly sophisticated classification methods (Sarker, 2021). Random Forest Random forest is used for both classification and regression problems in machine learning. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to enhance the performance of the model. Instead of depending on one decision tree, the random forest takes the prediction from each tree and which prediction has have majority of votes is determined to be the final output (Sarker, 2021). The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022 8 Predictive Analytics of Attrition Drivers and How to Mitigate It K-Nearest Neighbors (KNN) A lazy learning algorithm because when we supply training data to this algorithm, the algorithm does not train itself. KNN classifies data sets based on their similarity with neighbors. The classification using KNN involves determining neighboring data points and then deciding the class based on the classes of the neighbors. Usually, the Euclidean distance is used as the distance metric (Sarker, 2021). Artificial Neural Network (ANN) ANN consists of multiple layers of nodes, each fully connected to the next. The algorithm uses backpropagation for training the model. The input is transformed using a learned non-linear transformation, which projects the input data into a space where it becomes linearly separable. This intermediate layer is called a hidden layer (Sarker, 2021). Support Vector Machine (SVM) SVM is a model used for classification and regression problems. It can solve linear and non-linear problems. The idea of SVM is simple; the algorithm creates a line or a hyper plane, which separates the data into classes. When unknown data is given as input, it predicts which class it belongs to. The margin between the hyper plane and the support vectors are as large as possible to reduce the error in classification (Sarker, 2021). Extreme Gradient Boosting (XGBoost) Gradient Boosting is an ensemble learning algorithm that generates a final model based on a series of individual models. Extreme Gradient Boosting is a form of gradient boosting that takes more detailed approximations into account when determining the best model. It reduces over-fitting and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well (Sarker, 2021). Analysis and Findings In this section, we demonstrate the result of two experiments performed on the dataset. In each experiment, Machine learning algorithms are used to train the model then evaluated, but in the second experiment, we introduce the hyperparameter tuning technique to optimize the models. Evaluation Metrics Model evaluation shows us how the model performs (Vujović, 2021). We employed four metrics to evaluate the prediction models, defines as in below: ● ● ● ● Accuracy measures how often the classifier correctly predicts. Sensitivity is a measure of how well a machine learning model can detect positive classes Precision quantifies the number of positive class predictions that actually belong to the positive class F1 score gives a combined idea about precision and recall metrics. It is maximum when precision is equal to recall Due to the complexity of predicting employee attrition, accuracy score alone is not enough to choose the best model. Thus, besides precision, we use F1 and sensitivity scores for better evaluation to determine the best model. Experiment 1 In this experiment, the data was partitioned into two sets, training (70%) and testing (30%). We applied several machine learning algorithms (Table 5). All the algorithms were implemented using their default parameters. The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022 9 Predictive Analytics of Attrition Drivers and How to Mitigate It Algorithm name Accuracy Sensitivity Precision F1 Score Logistic Regression 0.7211 0.6286 0.3121 0.4171 Naïve Bayes 0.5624 0.6857 0.2192 0.3322 Random Forest 0.7551 0.5429 0.3333 0.4130 K-Nearest Neighbors 0.7120 0.5000 0.2756 0.3553 Artificial Neural Network 0.7687 0.3571 0.3049 0.3289 Support Vector Machine 0.7279 0.5857 0.3106 0.4059 XGBoost 0.7868 0.4429 0.3605 0.3974 Table 5. Experiment 1 model results Receiver operating characteristic (ROC) curve can show the general ‘predictiveness’ of a classifier. It measures the probability that the classifier ranks a randomly chosen positive instance higher than a randomly chosen negative instance. It appears that the closer the curve to the top left corner, the better the classifier (Dey, 2021). In Figure 9, we show the ROC curves for all the developed models. Figure 9. ROC Curves According to Table 5, the accuracy for all the models is around 75%, except naïve Bayes with 56%. Nevertheless, they demonstrate poor precision and F1 score. Overall, the logistic regression evaluation metrics, supported by ROC curve, are reasonably better since the accuracy and sensitivity are good and there is no overfitting (accuracy is 78% in training set). In the next experiment, we try to use hyperparameter tuning to improve the models as well as the voting classifier is introduced. Experiment 2 (hyperparameter tuning and voting classifier) In this experiment, we follow the same steps in experiment 1, but with employing hyperparameter tuning. Hyperparameter tuning (or hyperparameter optimization) is the process of determining the right combination of hyperparameters that maximizes model performance. It works by running multiple trials in a single training process. There are different hyperparameter methods, for instance, grid search and random search. In grid search, we define the combinations and do training of the model, whereas in random search, the model selects the combinations randomly (Yang & Shami, 2020). We use the random search in this experiment to avoid human intervention and so biasness. Another technique to improve the performance is called voting classifier. It is a machine learning model that trains on an ensemble of numerous models and predicts an output (class) based on their highest probability of chosen class as the output. It simply aggregates the findings of each classifier passed into voting classifier and predicts the output class based on the highest majority of voting. The voting type can be hard (predicted output class) or soft (predicted probability of the output class) Leon et al. (2017). Soft voting was used here since it combines the probabilities of each prediction in each model and picks the prediction with the highest total probability. The 5th MENA Conference on Information Systems (MENACIS), Dhahran 2022 10 Predictive Analytics of Attrition Drivers and How to Mitigate It To build the voting model, three algorithms were selected: logistic regression, naïve bayes, and random forest. The selected algorithms work differently and are likely to make different errors. We first applied hyperparameter technique to improve the performance, then applied the voting classifier model, which appears to outperform all the individual classifiers with 81% accuracy and 85% sensitivity rate with no overfitting issue (Figure 10). This model significantly enhances the performance; however, it requires a high computation power. Therefore, building, training, and deploying such models are more costly (Leon et al., 2017). Figure 10 depicts the voting model as compared to the other selected algorithms. Random Forest Logistic Regrission Naive bayes SOFT VOTING CLASSIFIER Logistic regression 72% Random forest 76% Naïve Bayes 56% Soft voting 81% Figure 10. Soft Voting Model Accuracy We summarized the model results after applying the hyperparameter technique in Table 6 below. The table shows a noticeable improvement in all the measures compared to the previous experiment. Likewise, the overfitting was dramatically reduced except in XGBoost model, which yields 100% accuracy rate in training set but 90% in the testing set. AdaBoost algorithm was used as an alternative model to XGBoost since the dataset was not complex and its dimension was low (Leon et al. 2017). Despite the limited hyper