Description
look at the doc
Unformatted Attachment Preview
Regression Project (Due January 2nd at 11:59pm)
You should work with one or two partners on this project. Please only turn in one
copy between the two of you with both names clearly appearing on the
submission.
In this project you will be responsible for finding a dataset and estimating a linear regression model for
the quantitative columns in the dataset. Look online for a dataset containing at least 100 observations
and at least four columns that can be studied using a Linear Regression Model. It is recommended that
you find a dataset in excel or .csv format for easy importation into Rstudio.
A fairly large dataset repository is located below, however you are not required to use it:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
Submission Instructions: Please turn in this project on or prior to the day of the Final Exam.
Directions: Treat the dataset as a sample that is taken from some larger population. Choose one of the
quantitative columns as the response variable and at least three remaining columns as predictor variables.
Note if any of the predictor variable columns you choose are qualitative, you will need to consult notes
discussing “Qualitative Predictor Variables” on how to interpret these.
Based on your initial understanding of the dataset (without doing any computations in Rstudio) come up
with an initial guess as to what the true values of the coefficients in front of the predictor variables might
be. Throughout the project you will be responsible to testing whether your guesses are correct (your guess
being close or not will not impact your grade).
You should work with a partner on this project! All statements should be typed and any work done in ink
should be carried out neatly throughout the document. It is recommended that all computations
performed on the dataset be done in Rstudio (you should not use any other software).
Template
Introduction: Write out a brief statement (no more than two paragraphs) regarding what relationship you
are studying using the regression model, the sample you are considering, where the data was found, the
number of observations in your sample, as well as your initial guess for the model coefficients (mentioned
above). You should also include a description on what you did in the study and what you found (at the
highest possible level of confidence you considered).
Preliminary Analysis: Compute important descriptive statistics such as mean, variance and correlation for
the quantitative variables in your dataset. Mention what they are and what they represent (either using
mathematical notation or without using mathematical notation). Think about how to visualize your
response variable using the ggplot() function and several (at least 4) different geometries. One of these
geometries can be scatterplots with respect to one or several of your predictor variables.
NEXT PAGE
Inference: Using your dataset as a sample
1) Estimate the coefficients in the Regression Model and state the estimated Regression Model.
2) Compute an estimate for the variance of the residuals using the dataset.
3) Perform residual analysis to determine if a multiple linear regression model makes sense to study
the relationship between the response and predictor variables. If there are issues, simply state
what issues you see, but continue using the data as though there are no issues.
4) See if there is evidence of collinearity. If there is, solve this issue with one of the prescribed
methods we discussed. Use the final predictor variables for the remainder of this project. If this
results in only one usable predictor variable, find another predictor variable to use in the dataset
or choose another dataset.
5) Compute a 95% Confidence Interval for each coefficient in the regression model. Explain what
each CI is telling us.
6) Compute a 99% Confidence Interval for each coefficient in the regression model. Explain what
each CI is telling us.
7) Conduct a 99% CL Hypothesis test for each coefficient to test whether the true coefficient is equal
or not equal to your initial guess (mentioned in the introduction).
8) Conduct a 99% CL Hypothesis test for each predictor variable to test whether the predictor
variable has an impact on the response variable.
9) Conduct a 95% CL Hypothesis test to test whether any of the predictor variables has an impact on
the response variable.
10) Split the dataset into a “Train” and “Test” set where the “Train” dataset contains approximately
90% of the observations. Estimate the model using the “Train” set and obtain the MSE of the
model using the “Test” set.
11) Obtain the MSE of the regression model using n-fold cross-validation with the predictor variables
used in (10). Comment on whether this MSE is larger than the one in (10).
12) Select a predictor variable to remove from the model (preferably based on your answers to (8)).
Compute the MSE based on n-fold cross validation using the model that uses the remaining
predictor variables. Use your findings to comment on whether you should use predictor variables
in (11) or (12).
Conclusion: Summarize the results of your hypothesis tests, MSE results and mention which predictor
variables should be used in a model to predict the response variable.
Purchase answer to see full
attachment