Description
All files needed for the assignment including instructions have been uploaded below.
Unformatted Attachment Preview
Scenario
Your team lead at PwC just briefed you on a new project she would
like your help with. You will be assisting with analyzing employee
engagement data from LoanTronic.
…
New Message
To: LoanTronic Consulting Group
Subject: Master Dataset for Employee Engagement
Good morning,
We have a new client, LoanTronic, who would like our help
analyzing employee engagement data. LoanTronic has supplied us
with three data files to assist us in our analysis. These files contain a
list of all employees along with some demographic data.
• questions.csv Download questions.csv
• Risk Column(s): to be confirmed
• responses.csv Download responses.csv
• Risk Column(s): to be confirmed
• roster.csv Download roster.csv
For our convenience, the client also provided us with the data
dictionary for the files. (Click each to open.)
questions.csv
Attribute
Question
Number
Measurement
Field Name
id
Type Categorical Descriptions
int
False
Question number
measurement str
True
Attribute measured by the
question.
Acceptable values are:
•
engagement
•
leadership
•
enablement
•
alignment
•
development
Question
question
str
False
The question as posed to the
user.
Note that [Company] gets
replaced with the client name
responses.csv
Attribute Field Name Type Categorical Descriptions
Employee employee_id int
False
Employee ID is a unique identifier
ID
given to each employee at the time
of onboarding
Question question_id int
False
Question number from the
ID
questions.csv file
Answer
answer
int
True
Response on a Likert Scale:
•
1 – strongly disagree
•
2 – disagree
•
3 – neutral
•
4 – agree
•
5 – strongly agree
Score
score
float True
Likert Scale response converted
into a percentage:
•
1 – strongly disagree 0.2
•
2 – disagree – 0.4
•
3 – neutral – 0.6
•
4 – agree – 0.8
•
5 – strongly agree 1.00
roster.csv
Attribute
Field Name
Employee ID employee_id
Title
Last Name
title
last_name
First Name
first_name
Manager ID
manager_ID
Type Categorical Descriptions
int
False
Employee ID is a unique
identifier given to each
employee at the time of
onboarding
str
False
Position title
str
False
Employee last name
(anonymized)
str
False
Employee first name
(anonymized)
int
False
Employee ID of the
manager
Function
function
str
True
Department
department
str
True
Location
location
str
True
Age
age
str
True
Sex
sex
str
True
Ethnicity
ethnicity
str
True
Functional reporting
structure.
Legitimate values are:
•
Engineering
•
Finance &
Administration
•
Loan
Operations
•
Marketing
•
Product
•
Technology
•
Compliance
Department
•
Produce the
list of departments
Location where the
employee is based.
Legitimate values are:
•
Atlanta, GA
•
Austin, TX
•
Chicago, IL
•
Mountain
View, CA
•
New York,
NY
Age bracket of the
employee. Legitimate
values are:
•
25 – 34
•
35 – 44
•
45 – 54
•
55 – 64
•
65+
Sex of the employee.
Legitimate values are:
•
M (male)
•
F (female)
Ethnicity of the employee.
Legitimate values are:
•
african_american
Employment employment_status str
Status
True
Tenure
True
Tenure
str
•
asian
•
hispanic
•
mixed_race
•
white
•
other
Employment Status.
Legitimate values are:
•
Full time
•
Part time
Time elapsed since the
start of employment.
Legitimate values are:
•
less than 6
months
•
6 months to
less than 1 year
•
1 to less than
2 years
•
2 to less than
4 years
•
4 to less than
6 years
The roster file is the product of the client’s Human Resources team.
We will use this file in more detail in our future work. Unfortunately,
every HR team member has a different idea of how to enter data, so
the categorical variables are very likely to have problems that will
need to be fixed. Our goal is to create a master dataset we can use to
assess employee engagement along five key measurements needed
for our final report: engagement, leadership, enablement, alignment,
and development. In order to do this, you will need to load each file,
inspect the content, identify issues, and fix them. Then you will need
to compute the summary statistics for the data and create a
visualization to show the distribution of the measurement scores.
Please let me know if you have any questions.
Thank you,
Tye Brandert
Key Tasks
Let’s get a feel for the employee survey data from LoanTronic. The
client shared three files with us which were subject to problems.
questions.csv
• Risk Column(s): to be confirmed
responses.csv
• Risk Column(s): to be confirmed
roster.csv
• Risk Column(s): location, ethnicity
Please load the first two files, inspect the content, identify issues,
and fix them. We will also need to convert any categorical variables
into a pandas categorical dtype. For your convenience, the data
dictionary for the first two files is included in the back of this packet.
We’ll examine the roster file next week, but if you are super
motivated, feel free to check out the dataset and get a head start at
thinking about the results.
Question #1 — Inspect & Clean Questions
Use pandas to load and inspect the questions.csv data and ensure
the questions make sense and that the data is clean.
(a) how many “measurement” categories are there?
(b) do you notice anything strange about the category labels?
(c) clean up the measurement categories if necessary
(d) save the clean version of the file under the name:
questions_clean.csv
Question #2 — Inspect & Clean reponses.csv
Use pandas to load and inspect the responses.csv data and ensure
the responses fit the five point Likert scale.
Label Description
1
strongly disagree
2
disagree
3
neutral
4
agree
5
strongly disagree
Since this file was generated by the automated software, you do not
expect any significant issues, but better be safe than sorry. Let’s
double check to make sure.
(a) Are any responses outside the expected range?
– If so, what should you do about it?
– Apply the fix you described above if necessary.
(b) Do you observe any partial survey completions?
– If so, what should you do about it?
– Apply the fix you described above if necessary.
(c) Do you notice anything unexpected with any of the responses?
– If so, what should you do about it?
– Apply the fix you described above if necessary.
(d) Let’s assume an equal distance between each label so that we
can translate individual scores to a percentage. This results in the
following scores and add a column called “`score“` to you dataframe
and place the computed score for each answer:
Label Description
Score
1
strongly disagree 0.20
2
disagree
0.40
3
neutral
0.60
4
agree
0.80
5
strongly disagree 1.00
(e) Save a clean version of the file once you have fixed issues. Use
the name: responses_clean.csv
What to Submit
An annotated Jupyter notebook in html format
• Submit a final dataset called: questions_clean.csv
• Submit a final dataset called: responses_clean.csv
• Naming Convention for File Attachments: Please name your
dataset as “the file name_initials of your name_the last four
digits of the student id”
Example: filename_jd_0123.csv
Where “jd” is initials for student “John Doe” and “0123” are the last
four digits of John’s student id.
• Follow these instructions to learn how to submit multiple files
in Canvas.
•
Data Dictionary
A data dictionary provides the basic information about each field in
the dataset.
questions.csv
Attribute
Field Name
Type Categorical Description
Question
id
int
False
Question number
Number
Measurement measurement str
True
Attribute measured by the
question. Acceptable values are:
– engagement
– leadership
– enablement
– alignment
– development
Question
question
str
False
The question as posed to the
user. Note that [Company] gets
replaced with the client name
responses.csv
Attribute
Employee
ID
Question
ID
Answer
Field Name Type Categorical Description
employee_id int
False
Employee ID is a unique identifier
given to each employee at the time of
onboarding
question_id int
False
Question number from the
questions.csv file
Answer
int
True
Response on a Likert Scale:
– 1 – strongly disagree
– 2 – disagree
– 3 – neutral
– 4 – agree
– 5 – strongly agree
Project 1: Data Cleaning & Wrangling Rubric (1) (1)
Project 1: Data Cleaning & Wrangling Rubric (1) (1)
Criteria
Ratings
Inspect and Clean the Roster File
[C13O3|C13C11|C13C12]
18 to >14.4 pts
Standard
Identifies and
fixes any
issues with
categorical
data in the
roster file
14.4 to >12.6 pts
Approaching
Standard
Identifies and fixes
any issues with
categorical data in
the roster file, but
with gaps in
accuracy or logic
12.6 to >0 pts
Below
Standard
Does not
identify and/or
fix any issues
with categorical
data in the
roster file
18 to >14.4 pts
Standard
Merges the
measurement
category from
the questions
file to the
responses file
14.4 to >12.6 pts
Approaching
Standard
Merges the
measurement
category from the
questions file to
the responses file,
but with gaps in
accuracy
12.6 to >0 pts
Below Standard
Does not merge
the measurement
category from
the questions file
to the responses
file
18 to >14.4 pts
Standard
Produces a
clean, accurate
dataset with all
of the required
information
14.4 to >12.6 pts
Approaching
Standard
Produces a dataset
with all of the
required
information, but
with gaps in
accuracy or logic
12.6 to >0 pts
Below
Standard
Does not
produce a clean,
accurate dataset
with all of the
required
information
Add the Measurement Category to the Responses
[C13O3|C13C11|C13C12]
Build the Final Dataset
[C13O3|C13C11]
Project 1: Data Cleaning & Wrangling Rubric (1) (1)
Criteria
Ratings
Summary Statistics
[C13O1|C13O3|C13O4|C13C1|C13C11|C13C19]
18 to >14.4 pts
Standard
Computes the
requested
summary
statistics,
produces a
Pareto chart
based on the
results, and
responds to the
questions
regarding the
results
14.4 to >12.6 pts
Approaching
Standard
Computes the
requested summary
statistics, produces a
Pareto chart based on
the results, and
responds to the
questions regarding
the results, but with
gaps in accuracy,
clarity, detail, logic,
or support
12.6 to >0 pts
Below
Standard
Does not
compute the
requested
summary
statistics,
produce a
Pareto chart
based on the
results, or
respond to the
questions
regarding the
results
5 to >4.0 pts
Standard
Grammar,
spelling, and
punctuation
adhere to
professional
standards
4 to >3.5 pts
Approaching
Standard
Grammar, spelling,
and punctuation
negatively impact
the professionalism
of the submission
3.5 to >0 pts
Below Standard
Grammar,
spelling, and
punctuation
impedes the
understanding of
the
ideas/concepts
Mechanics
Project 1: Data Cleaning & Wrangling Rubric (1) (1)
Criteria
Ratings
Presentation
5 to >4.0 pts
Standard
Concisely
articulates
information,
ideas, and
concepts in an
organized,
convincing
fashion that
aligns with
target audience
needs
Total Points: 100
4 to >3.5 pts
Approaching
Standard
Articulation of
information,
ideas, and
concepts is
drawn-out,
unclear,
disorganized,
and/or does not
consider target
audience
3.5 to >0 pts
Below Standard
Articulation of
information,
ideas, and
concepts is
disorganized,
unprofessional,
AND does not
consider target
audience
Purchase answer to see full
attachment