R Question

Description


Unformatted Attachment Preview

Don't use plagiarized sources. Get Your Custom Assignment on
R Question
From as Little as $13/Page

AD699: Data Mining for Business Analytics
Individual Assignment #3
Things to Keep in Mind:
I.
II.
If you run into a syntax error while preparing this assignment, do not panic. Show the code you ran, and
show the error you encountered. Explain the purpose of the step, and state what you were trying to do. If
the error is preventing you from running any follow-on steps, again, focus on the explanation — show that
you understand the purpose of the step, rather than just giving up.
Use your resources. Whether it’s the recitation sessions with the course TAs, office hours with Professor
Page, e-mail, the video library, your classmates, the web, etc. there are many places to look for help or to
clarify any questions that you may have. As the AD699 slogan says, “Get After It!”
To submit this assignment, you will upload two files into Blackboard. One file will be the R script that
you used, and the other will be your write-up, submitted in the form of a PDF.
Your PDF should clearly demonstrate your code and your results for all steps. For any part of the
prompt that asks you a question, or asks you to describe something, you should include a written
answer in your write-up.
You may use any reporting format that clearly demonstrates your code, results, and interpretation
statements. If you do not already use R Markdown or R Notebook, you may wish to explore these
options.
Main Topic:
Classification
Tasks:

K-Nearest Neighbors: You Like This Song…But Will George Like It?
To answer this question, we will use the spotify dataset, which can be downloaded from our
class Blackboard site, as well as the spot23.csv dataset, which can also be found on our class
Blackboard page.
A description of both datasets can be found on our class Blackboard page, in the same
folder where you found this assignment prompt.
1.
Read the dataset spot23 into your environment. The dataset includes the 953 most
streamed songs on Spotify in the first half of 2023. In this dataset, each row represents
one particular song.
Scroll through this dataset, using any method that you prefer, and select any song. If
you find a song that you know and like, then select it. If you don’t see any songs that
you know, you can just pick any title here (or, even better, you can look up some
songs, listen to them, and see if you like any).
a. What song did you pick?
b. In a sentence or two, why did you pick this song?
c. What values does your song have for the following categories:
danceability:
energy:
speechiness:
acousticness:
liveness:
valence:
2. Extract the row that contains info for your song from spot23.csv. Verify
that your song is now stored as its own dataframe (if it’s not currently a dataframe,
convert it into one now)
3. Now, read spotify.csv dataset into your environment. Call the str() function on your
dataset and show the results.
a. What type of variable is target? If target is not currently a factor, convert it into a
factor. It will be our response variable in this model. Target tells us whether
George, the person who uploaded this dataset, liked the song. “1” means that
George liked it, and “0” means that he did not.
b. What unique values does the target variable have? For each of these outcome
values, find out how many records in the dataset have that value, and state it
here.
4. Are there any NAs in this dataset? Show the code that you used to find this out. If there
are any NA values in any particular column, replace them with the median value for
that column.
5. Now it’s time for just a bit of data engineering.
a. Currently, the values for danceability, energy, speechiness, valence,
acousticness, and liveness are stored as whole numbers that represent
percentage values. Change this by converting those values into decimals
(for instance, if you have a value of 74, it needs to become 0.74).
b. Change the names of your column variables for danceability, energy,
speechiness, valence, acousticness, and liveness so that they match the
names in spotify. The names must become a perfect match, so watch out
for things like capitalization, punctuation, formatting, etc.
6. Using your assigned seed value (from Assignment 2), partition the spotify dataset into
training (60%) and validation (40%) sets.
7.
Next, we will do some variable selection.
a. Using your training set, perform a series of tw0-sample t-tests to compare the
numeric values for songs that George liked vs ones that he did not like. Focus
only on these six variables: danceability, energy, speechiness, valence,
acousticness, and liveness.
b. Which variables show a significant difference? For any variables for which there
is not a significant difference between the ‘like’ and ‘dislike’ values, remove
them entirely from your data.
c. In a sentence or two, why might it make sense to remove variables from a k-nn
model when those variables’ values are very similar for both outcome classes?
For this two-sample t-test, there’s no need to first check the data for normality; given the sample
size and the distribution of the values, it will be robust to non-normality.
8. Normalize your data using the preProcess() function from the caret package. Use
Table 7.2 from the book as a guide for this.
9. Using the knn() function from the FNN package, and using a k-value of 7, generate a
predicted classification for your song — Will George like it or not? What outcome did
the model predict? Also, what were your song’s 7 nearest neighbors? List their titles,
artists, and outcome classes. Be sure to show their outcome classes in your write-up.
As you perform this step, be sure that the variable names for your song are formatted the same way
(spelling, capitalization, etc.) as the ones in the spotify dataset.
10. Use your validation set to help you determine an optimal k-value. Use Table 7.3 from
the textbook as a guide here.
11. Using either the base graphics package or ggplot, make a scatterplot with the various k
values that you used in the previous step on your x-axis, and the accuracy metrics on
the y-axis.
12. Re-run your knn() function with the optimal k-value that you found previously. What
result did you obtain? Was it different from the result you saw when you first ran the
k-nn function? Also, what were the outcome classes for each of your song’s k-nearest
neighbors? Be sure to show their outcome classes in your write-up.
13. In a couple of sentences, point out some limitations associated with making a model
such as this one. In particular, focus on what we’re doing here – we are using numeric
attributes to predict whether someone will like a particular song.

Naive Bayes:
Again in this section, we will be performing classification. For this model, we will predict
whether a member of a Canadian gym (Fitness Zone) will attend a class after having signed up.
1.
Bring the fitness_zone dataset int0 your local environment.
a. Which of your variables here are numeric, and which are categorical? Note:
relying on the str() function alone is not an advisable way to answer this.
2. Exploring the dataset and preparing the variables
a. Generate a table showing missingness by column for the entire dataset (this is
very similar to something you did on Assignment 1).
i.
Where do you see missing values here in this dataset?
3. For any word-based variables in the dataset, convert them to factors now.
4. The response variable in our model will be attended.
a. What are the two outcome classes for attended? How prevalent is each of them
in the dataset?
b. Convert attended into a factor.
5. Why won’t booking_id be a useful predictor in this model?
6. For the numeric variables in your data, bin them into factors. Bin them using equal
frequency binning. Be sure to give a label to each bin. Select a bin number of your
choice. If you cannot place the numerical variables into bins with perfectly equal
frequency, just do the best you can with them.
a. Show the results of this process, using the table() function.
b. What is the difference between equal width binning and equal frequency
binning? Why might equal frequency binning be preferable in some scenarios?
c. After performing the binning, convert any NA values for the variable you
identified in an earlier step into their own level.
i.
Why might NAs sometimes be interesting as a predictor class, in and of
themselves?
ii.
Using the table() function, show each of the levels for the variable that
originally contained some NA values.
7.
Using your seed value (the same one from Assignment #2) , partition your data into
training (60%) and validation (40%) sets.
8. Let’s take a look at the variables from the dataset, and explore the way that they might
impact attended. Using your training set data only, make a proportional barplot for
each one of your prospective input variables. Each barplot should show one of your
input variables as a category on the x-axis, with attended as the fill variable. You should
build proportional barplots (you can achieve this by adding position=”fill” inside your
geom layer).
a. Based on the barplots that you see here, select any variable(s) that seem like they
will not have a strong amount of predictive power in a naive Bayes model. Drop
any such variable.
9. Build a naive bayes model, with the response variable attended. Use all of the other
remaining variables in your training set as inputs. Show your model results.
10. Show a confusion matrix that compares the performance of your model against the
training data, and another that shows its performance against the validation data (just
use the accuracy metric for this analysis​). How did your training set’s performance
compare with your validation set’s performance?
11. In classification, what is the naive rule? If you had used the naive rule as an approach
to classification, how would you have classified all the records in your training set?
(Note: Although their names are very similar, the naive rule for classification is very
different from a naive Bayes approach to classification).
a. How did your model’s accuracy compare with the naive rule accuracy, in
percentage terms? (Answer in percentage difference, not in percentage points
difference).
12. Next, take a subset of the 100 records in your validation set that your model predicted
to be most likely to miss their fitness class (Table 8.6 in the textbook will be a very good
thing to look at in order to build this)
a. Among those 100 records, how many of the people actually missed their class?
How does the accuracy for these predictions compare to the overall model?
b. How could the Fitness Zone use this information? In a few sentences, indicate
what it would mean for the gym to be able to identify this particular subset of
records, and the way they could act on this information.
13. Pick any ONE record from your training set. It can be any row in the training set – it
doesn’t matter.
a. Did the person attend the class that they booked?
b. Use the predict() function to see what your model predicted for this person.
What did it predict? (Table 8.6 in the textbook might be helpful here)
c. Now, use the predict() function again but with a slight modification, in order to
have it generate the probability that your person would attend their class. What
was the probability?
d. As a last step, demonstrate the way that the probability associated with this
prediction was generated. Do this the way we did the flight delay examples in
class — use R to do the math, but don’t use any special functions or packages.

Purchase answer to see full
attachment