Description
I just need the python portion of the assignment (part 2: questions 4,5,6) completed because I am having trouble uploading the csv file into jupyter notebook to read it. Question 4 deals with the measurement csv and question 5 and 6 go with the penguins csv file. I have also provided an example from a friend from the same assignment with different question from a previous year.
Unformatted Attachment Preview
DS1000B – Assignment #3
Due: Mar 17, 2023 @ 11:55pm
Notes:
•
Submissions must be done via Gradescope. You must carefully assign pages to their
corresponding questions. You will receive a grade of zero in each case below:
a. Submission not in PDF format.
b. Questions with no pages assigned to them.
•
Please submit a single PDF file. Here is a recommended way to achieve this:
a. If you write your derivation on papers, you can scan them into a pdf file (if they are
images, paste images to a word document then save as a pdf file).
b. Write your Python code (e.g. in Jupyter notebook) then save it as a pdf file.
c. Combine all the pdf files above into one pdf file.
•
If you have difficulty in formatting your submission, please see the “Lab1-preparation” file,
or attend TA office hours as soon as possible.
•
Each student must submit their own work. Scholastic offences are taken seriously, and
students are directed to read the appropriate policy, specifically, the definition of what
constitutes a Scholastic Offence, at the following Web site:
http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf
Grade Breakdown:
Part 1: Written Answer
Question 1
20
Question 2
10
Question 3
7
Total Points = 37
Part 2: Python
Question 4
Question 5
Question 6
Total Points =
9
10
9
28
Total Points:
65
Part 1 – Written Answer (Be sure to show all your work)
Question 1 [19 Points]
A study was conducted to see the number of pets who were overweight. The results are in the following
table:
Overweight status
Yes
No
Pet group
Dogs
Hamsters
Cats
35
40
25
75
30
10
a) [5 points] Calculate the marginal distribution for pets and the marginal distribution for overweight
status.
b)
[3 points] Calculate the conditional distributions of overweight status given pet group.
c) [3 points] Calculate the conditional distributions of pet group given overweight status.
d) [5 points] Draw a bidimensional bar graph (by hand) to visualize the conditional distribution of
overweight status given pet group. Put the pet group on the x-axis.
e) [2 points] Calculate the relative risk of being overweight for cats versus dogs. Interpret.
f)
[2 points] Calculate the odds ratio of being overweight for cats versus hamsters. Interpret.
Question 2 [10 Points]
What type of sampling technique was used in the following scenario.
a) [2 points] Hoping to learn what requests students have for improvements to campus, the survey
coordinator groups students by faculty and randomly selects 15 students from each group.
b) [2 points] The campaign director for a mayoral candidate selects one block from each of the city’s
election districts. Staff members go there and interview all the residents they can find.
c) [2 points] Researchers waited outside a new bar in the city. They stopped every 5th person who came
out of the bar and asked whether he or she thought drinking and driving was a serious problem.
d) [2 points] A company packaging snack food maintains quality control by randomly selecting 100 cases
from the total production and weighing the bags.
In the following scenario, what is the issue of this sampling technique? (no explanation required)
e) [2 points] A doctor asks his patients one day to estimate the percentage of people with valid health
cards.
Question 3 [7 Points]
You want to estimate how students at Western do on midterms versus final exams. To save time,
you decide to survey 10 students in a first-year class and ask them what their grade was on their
midterm and final exams. Assume that there 24 students in this statistics class, and each one is
labelled 01 through 24.
a) [1 point] What is the population for this survey?
b) [1 point] What is the sample?
c) [2 points] Starting on line 111 of Table B, who are the 10 students you would select for the
survey?
d) [3 points] Since you only surveyed students from a first-year class, can you identify a
confounding variable in this situation? How could you change your study to address this
issue?
Part 2 – Python (All numbers and graphs need to be produced using Python)
Question 4 [9 points]
The file trees.csv provides measurements of the diameter (in inches) and volume of timber (in
cubic feet) from 31 black cherry trees. The first column of the data contains the ID for each tree,
which is a number from 1 to 31.
a) [2 points] Make a scatterplot placing diameter on the x-axis and volume on the y-axis.
Explain in words: what kind of pattern (i.e. direction, form and strength) does your plot
show?
b) [2 points] Find the Pearson correlation coefficient r between diameter and volume. Explain
in words: what does it tell us?
c) [3 points] The tree corresponding to ID = 21 has a diameter of 14 and a volume of 34.5.
Replace the volume of this tree by 100 and recalculate the correlation r from part b).
d) [2 points] Explain in words: Did the correlation computed in part c) between distance and
time change in comparison to the one found in part b) ? Why or why not? (It is optional to
draw another scatterplot for the modified data.)
Questions 5 and 6 are both based on the file penguins.csv.
Question 5 [10 points]
The file penguins.csv contains measurements for penguins foraging near Palmer Station in
Antarctica. The dataset includes bill (or beak) length (in mm) and bill depth (in mm) for 333
penguins from three species (Adelie, Chinstrap and Gentoo). In the following context, treat bill
length as the explanatory variable and bill depth as response variable.
a) [3 points] Perform a linear regression for predicting penguin bill depth from bill length. Print
out the slope and intercept. Interpret the intercept in the context of the problem.
b) [2 points] Compute the coefficient of determination ( 2 ) for the regression above. Interpret
it.
c) [2 points] Obtain the least-squared regression line (by printing out the slope and intercept)
for predicting a penguin’s bill depth from bill length for only Adelie penguins. Interpret the
slope in the context of the problem.
d) [3 points] Repeat the steps (regression and interpretation of slope) in part c) for the other
two species.
Question 6 [9 points]
Using the same dataset as Question 6, perform the following analysis. We will explore the
Simpson’s paradox in the context of regression. Place the bill length on the x-axis and bill depth on
the y-axis by default.
a) [2 points] Use the sns.lmplot() function to make a scatterplot of the data with the regression
line.
b) [3 points] Use the sns.lmplot() function to make a scatterplot with three fitted regression
lines, one for each penguin species. Use three different colors to distinguish the cloud of
points and regression lines for each of the three species.
c) [4 points] Simpson’s paradox is a phenomenon in statistics that may lead to misleading
conclusions if not considered. The idea of the paradox is that the direction of association
(positive or negative) between two variables can be reversed when accounting for a third
variable (groups) in the dataset. Explain in words: do the results above provide an example
of Simpson’s paradox? Why? (Hint: No need to give the reasons behind the paradox. Only
need to answer why or why not it is an example of Simpson’ paradox.)
d
d)
Purchase answer to see full
attachment