Description
DO THE QEESTIONS
Unformatted Attachment Preview
AD654: Marketing Analytics
Boston University
Assignment #2
Assignment 2: Market Segmentation & Conjoint Analysis
For this assignment, you will need two files: lobster_fans.csv and treasure_hunt.csv, each of
which can be found on our course Blackboard page.
For Parts I & II of this assignment, you will upload two files into Blackboard: The .ipynb file that
you create in Jupyter Notebook, and a PDF. There’s no need to create a ZIP or an RAR.
Lobster Land management prefers a PDF plus an .ipynb, so that the submission can
be directly read in Blackboard. Do not worry if there’s an issue with the way the PDF
renders; if something is missing, your Prof or TA will look into the ipynb.
For any question that asks you to perform some particular task, you just need to show your
input and output in Jupyter Notebook or Colab. Tasks will always be written in regular,
non-italicized font.
For any question that asks you to include interpretation, write your answer in a Markdown cell
in Jupyter Notebook. Any homework question that needs interpretation will be written in
italicized font. Do not simply write your answer in a code cell as a comment, but use a
Markdown cell instead.
Remember to be resourceful! There are many helpful resources available to you, including the
video library, the lecture notes on Blackboard, recitation sessions with the course TAs, the
office hours sessions, and the web.
Part I: Segmentation (5 points)
I.
As we roll through the rest of winter and into spring, Lobster Land is thinking about its
off-season marketing approach. To do so, Lobster Land wishes to employ your analytical skills.
Lobster Land has gathered some data from a sample of 654 visitors from last season. That data
is stored in a file called lobster_fans.csv. Now, Lobster Land seeks your help – can you identify
some interesting / meaningful segments from this data, and suggest specific ways to reach out to
them via email?
guestID
Each visitor in the dataset has a unique ID number from 1 to 654.
homestate
The homestate of the visitor
visits_2023
The total number of visits to Lobster Land from this person between Memorial
Day and Labor Day, 2023.
social_pres
This is an estimate of the person’s total social media usage, on a percentile
scale. A person with a value of 99 uses social media more than 99% of the
population in general, whereas a person with a value of 1 almost never uses
social media.
avg_duration
This is the average time, in minutes, that the person spent inside Lobster Land
during their 2023 visits to the park.
avg_rides_dry
This is the average number of non-water rides taken by the person, per visit,
during the 2023 season.
avg_rides_water
This is the average number of water rides, and water-adjacent rides, taken by
the person during the 2023 season (note: a “water-adjacent” ride is one in
which the rider may pass over or through water but is not likely to get wet.
total_merch
This is the passholder’s total spending on merchandise sold at the park in 2023.
referral codes
This is a count variable. During the 2023 season, visitors received a unique QR
code for referrals – if they could get a new visitor to visit Lobster Land, after
purchasing a ticket and using that code, both the new visitor and the refererer
received a 50% admissions coupon for use during that season.
total_snack_shack
This is the person’s total Snack Shack spending across the 2023 season.
total_gold_zone
This is the person’s total Gold Zone spending across the 2023 season.
A. Drop the guestID variable.
a. Why will guestID not be relevant in a clustering model? In your answer,
do not just write “it will confuse the model.” Instead, take the time to
explain this with a sentence or two, using a bit of math and your
understanding of Euclidean distance.
B. Call the describe() function on your dataset.
a. How does this function help you to gain an overall sense of the columns
and values in this (or any other) dataset? Why is this valuable for any
analyst who will use a dataset to build a model?
C. Missing values/impossible values
a. Does this dataset contain any missing values? If so, how many? Which
columns have missing values?
b. What about impossible values? Do you see any impossible values here?
If so, handle them in any way that you see fit. Why did you take this
approach?
D. Data scaling.
a. Do your variables need to be standardized? Why or why not?
b. If your data requires standardization, use Python to convert your values
into z-scores, and store the normalized data in a new dataframe. If not,
proceed to the next step without changing the variables.
E. Variable selection. Select any 5 variables from the potential set of inputs in order to build
your k-means clustering model.
a. Why did you choose this set of 5 variables? (Note: this can be
subjective. You don’t need to do any rigorous data analysis here). One
sentence per variable, or a single paragraph that explains how they
connect as a theme, will be fine here.
F. Elbow chart.
a. Build an elbow chart to help give you a sense of how you might build
your model.
G. How many clusters will you use for your k-means model? (Remember, as noted in several
places throughout the course material, there is no “right” answer to this question. You may
wish to answer this immediately after seeing your elbow plot, or after doing some more
experimentation).
H. Build a k-means model with your desired number of clusters.
I.
Generate and show summary statistics about each of your clusters.
J. Build any four simple visualizations to help management better understand your clusters (a
simple visualization could be a histogram, a barplot, a scatterplot, etc. – it should show original
variables from the dataset) You may wish to facet your visualizations by cluster.
For each one of your visualizations, include 2-3 sentences of description/
explanation. What does it show about your model?
K. Give a descriptive name to each one of your clusters, along with a few sentences of
explanation for the name that you chose. As you describe each segment, write a bit about the
types of visitors likely to belong to each group.
L. Finally, how can Lobster Land use this model to target the groups that you have identified
during the coming winter “off-season” period in its email campaign? Include at least a couple of
sentences for each group in your model. Also, in your answer, identify one group that you feel
is most worthy of outreach/engagement efforts, and say why. Also, identify one cluster that is
least worthy of outreach/engagement efforts, and say why. In your answer, be more creative
than just saying “send more discount coupons” to the groups.
Part II: Conjoint Analysis with a Linear Model: (4 points)
Lobster Land had a very successful 2023 season! However, park management is always
thinking about new ways to use the park to increase visitor engagement (and revenue!)
One idea recently suggested to Lobster land was to try an Augmented Reality (AR)
Treasure Hunt.
Although Lobster Land is only open from Memorial Day to Labor Day each year, there
are some beautiful days in Maine before Memorial Day, and after Labor Day. Lobster
Land could open its doors on some of those days – they would allow visitors to
use the park for the AR Treasure Hunt, but would not actually operate any of the
regular rides on these days.
To gather more information before moving ahead, the park conducted some survey
Research. They asked a general sample of the population near Portland, Maine about
their AR treasure hunt preferences. Each survey respondent saw a random sample of 5
possible options, or bundles, and was asked to rate those bundles from 1-10. By giving
this survey to many thousands of people, Lobster Land was able to generate this dataset.
The treasure_hunt.csv dataset contains 1944 rows — one each for each of the unique
feature combinations that the park tested. It also contains average ratings for each
combination.
Park management needs your help! Of course, the park could just rank the
combinations to quickly see which combination was most popular overall among
respondents, but they are hoping that you can do some conjoint analysis to help
them gain deeper, more meaningful insights about people’s preferences regarding
particular features and options.
This dataset contains the following variables:
bundleID
This is a unique integer value from 1 to 1944 that identifies each separate bundle.
narrative
Respondents had three choices here. A simple narrative means that it’s a straightforward
treasure hunt, with a basic storyline. A moderate narrative means that there is a detailed
storyline, with some twists and character backstories. A complex narrative means that
there is an intricate plot involving multiple characters, significant plot twists, and deeper
lore.
duration
Respondents had three choices here: 30 minutes, 60 minutes, or 90 minutes
theme
Respondents had three choices here: Pirate, Jungle Adventure, or Space Odyssey
reward_type
The treasure hunt prizes can come in one of two forms – digital or physical. The key
difference is a physical prize is something that could be touched and/or carried home. A
physical prize could be a stuffed animal, a paper coupon to something in the park, a
souvenir book, etc. A digital prize could be something like a digital ticket or coupon, or a
treasure hunt badge.
space_integration
Respondents had three options here. “Low: means that the treasure hunt is primarily
focused on the digital screen, with minimal reference to the surrounding environment.
“Medium” means that clues and tasks require interaction with specific physical landmarks
or locations within the park. The treasure hunt designers describe the “high” option as a
“seamless blend of digital and physical elements, where AR elements interact dynamically
with the environment.”
collaboration
Respondents had two options here – either a solo treasure hunt design, or a collaborative,
team-based design.
customization
Respondents saw two options here – standard and dynamic. A standard environment is
the same for each player, and with each gameplay opportunity. A dynamic design can
incorporate the player’s expressed preferences and previous gameplay history.
participant role
Respondents saw three options here: explorer, detective, and hero. For the “explorer”
option, participants search for hidden treasures across the park. For the “detective”
option, participants solve mysteries or crimes using AR clues. For the “hero” option,
participants embark on a quest to save the park or a character, facing challenges and
villains.
avg_rating
This is the average rating that the bundle received, on a score from 0 to 10.
A. Read the dataset treasure_hunt.csv into your local environment in Jupyter
Notebook or Colab.
B. Based on the descriptions shown above, which of your variables are numeric,
and which are categorical? (The standard you should use when answering this is
that something that is both represented by a number, and for which that number
has valid mathematical meaning, is numeric).
C.
After first removing the bundleID variable, use the pandas get_dummies()
function in order to prepare the remaining variables for use in a linear model. Inside
this function, include this argument: drop_first = True. Doing this will save us from
the multicollinearity problem that would make our model unreliable. Be sure to
dummify ALL of your input variables, even the numeric ones.
a. Why should the numeric input variables based on this survey data be
dummified?
D. Build a linear model with your data, using the average rating as the outcome
variable, and with all of your other variables as inputs.
E. Display the coefficient values of your model inputs.
F. Write a few paragraphs for Lobster Land management about what your model is
showing you.
It would be good here to include some detail about which features seemed to be
most popular, or least popular, among respondents. However, a truly thoughtful
answer to this question will go beyond simply listing the coefficients in order of
popularity. What OTHER insights can you draw from this? Is there anything else
that you think Lobster Land should consider before simply implementing the ‘most
popular’ options? Remember, Lobster Land hired you as a consultant — don’t be
afraid to show some creativity here.
“I can’t answer this because I’m not sure what this variable means” = NOT the way to go here.
You can also reach out to Prof. Page or to any of the course TAs with any questions about anything in
this dataset or prompt.
You can use either statsmodels or scikit-learn to build the model. If you use statsmodels, you may see
high p-values for individual levels of categorical variables – but keep all the variables you used at Step D.
Keep your analytical focus here on the coefficients. Every categorical variable in this model adds
significance. We are focusing here on interpretation of coefficient values – not on overall model fit or
predictive power.
Part III: Wildcard: Marketing & Segments (1 point)
A. Find ANY advertisement…ANYWHERE. As you walk around in your daily life, you
might look for an ad on the side of the T, on a bus stop, on a poster, etc. Alternatively,
you could use an advertisement that you encounter while browsing the web.
a. Take a picture of the ad that you see (if it’s in the ‘real world’). Or, if the ad you
select is online, take a screenshot from your phone or your laptop to capture
this advertisement.
b. Write ONE thoughtful paragraph that addresses the issue of segmentation.
What consumer segment is your ad targeting? What makes you think this?
What types of consumers are in the segment? Are you part of the segment? Or,
alternatively, is your ad an undifferentiated (mass market) ad? Finally, what is
your opinion of this ad – is it effective?
You can embed your image, along with your paragraph write-up, in a Markdown cell in Jupyter
Notebook. Alternatively, you could upload your image and paragraph in a separate file, such as
a Word doc. The ad can be in any language – but if it’s not in English, please translate.
AD654: Marketing Analytics
Boston University
Assignment #2
Assignment 2: Market Segmentation & Conjoint Analysis
For this assignment, you will need two files: lobster_fans.csv and treasure_hunt.csv, each of
which can be found on our course Blackboard page.
For Parts I & II of this assignment, you will upload two files into Blackboard: The .ipynb file that
you create in Jupyter Notebook, and a PDF. There’s no need to create a ZIP or an RAR.
Lobster Land management prefers a PDF plus an .ipynb, so that the submission can
be directly read in Blackboard. Do not worry if there’s an issue with the way the PDF
renders; if something is missing, your Prof or TA will look into the ipynb.
For any question that asks you to perform some particular task, you just need to show your
input and output in Jupyter Notebook or Colab. Tasks will always be written in regular,
non-italicized font.
For any question that asks you to include interpretation, write your answer in a Markdown cell
in Jupyter Notebook. Any homework question that needs interpretation will be written in
italicized font. Do not simply write your answer in a code cell as a comment, but use a
Markdown cell instead.
Remember to be resourceful! There are many helpful resources available to you, including the
video library, the lecture notes on Blackboard, recitation sessions with the course TAs, the
office hours sessions, and the web.
Part I: Segmentation (5 points)
I.
As we roll through the rest of winter and into spring, Lobster Land is thinking about its
off-season marketing approach. To do so, Lobster Land wishes to employ your analytical skills.
Lobster Land has gathered some data from a sample of 654 visitors from last season. That data
is stored in a file called lobster_fans.csv. Now, Lobster Land seeks your help – can you identify
some interesting / meaningful segments from this data, and suggest specific ways to reach out to
them via email?
guestID
Each visitor in the dataset has a unique ID number from 1 to 654.
homestate
The homestate of the visitor
visits_2023
The total number of visits to Lobster Land from this person between Memorial
Day and Labor Day, 2023.
social_pres
This is an estimate of the person’s total social media usage, on a percentile
scale. A person with a value of 99 uses social media more than 99% of the
population in general, whereas a person with a value of 1 almost never uses
social media.
avg_duration
This is the average time, in minutes, that the person spent inside Lobster Land
during their 2023 visits to the park.
avg_rides_dry
This is the average number of non-water rides taken by the person, per visit,
during the 2023 season.
avg_rides_water
This is the average number of water rides, and water-adjacent rides, taken by
the person during the 2023 season (note: a “water-adjacent” ride is one in
which the rider may pass over or through water but is not likely to get wet.
total_merch
This is the passholder’s total spending on merchandise sold at the park in 2023.
referral codes
This is a count variable. During the 2023 season, visitors received a unique QR
code for referrals – if they could get a new visitor to visit Lobster Land, after
purchasing a ticket and using that code, both the new visitor and the refererer
received a 50% admissions coupon for use during that season.
total_snack_shack
This is the person’s total Snack Shack spending across the 2023 season.
total_gold_zone
This is the person’s total Gold Zone spending across the 2023 season.
A. Drop the guestID variable.
a. Why will guestID not be relevant in a clustering model? In your answer,
do not just write “it will confuse the model.” Instead, take the time to
explain this with a sentence or two, using a bit of math and your
understanding of Euclidean distance.
B. Call the describe() function on your dataset.
a. How does this function help you to gain an overall sense of the columns
and values in this (or any other) dataset? Why is this valuable for any
analyst who will use a dataset to build a model?
C. Missing values/impossible values
a. Does this dataset contain any missing values? If so, how many? Which
columns have missing values?
b. What about impossible values? Do you see any impossible values here?
If so, handle them in any way that you see fit. Why did you take this
approach?
D. Data scaling.
a. Do your variables need to be standardized? Why or why not?
b. If your data requires standardization, use Python to convert your values
into z-scores, and store the normalized data in a new dataframe. If not,
proceed to the next step without changing the variables.
E. Variable selection. Select any 5 variables from the potential set of inputs in order to build
your k-means clustering model.
a. Why did you choose this set of 5 variables? (Note: this can be
subjective. You don’t need to do any rigorous data analysis here). One
sentence per variable, or a single paragraph that explains how they
connect as a theme, will be fine here.
F. Elbow chart.
a. Build an elbow chart to help give you a sense of how you might build
your model.
G. How many clusters will you use for your k-means model? (Remember, as noted in several
places throughout the course material, there is no “right” answer to this question. You may
wish to answer this immediately after seeing your elbow plot, or after doing some more
experimentation).
H. Build a k-means model with your desired number of clusters.
I.
Generate and show summary statistics about each of your clusters.
J. Build any four simple visualizations to help management better understand your clusters (a
simple visualization could be a histogram, a barplot, a scatterplot, etc. – it should show original
variables from the dataset) You may wish to facet your visualizations by cluster.
For each one of your visualizations, include 2-3 sentences of description/
explanation. What does it show about your model?
K. Give a descriptive name to each one of your clusters, along with a few sentences of
explanation for the name that you chose. As you describe each segment, write a bit about the
types of visitors likely to belong to each group.
L. Finally, how can Lobster Land use this model to target the groups that you have identified
during the coming winter “off-season” period in its email campaign? Include at least a couple of
sentences for each group in your model. Also, in your answer, identify one group that you feel
is most worthy of outreach/engagement efforts, and say why. Also, identify one cluster that is
least worthy of outreach/engagement efforts, and say why. In your answer, be more creative
than just saying “send more discount coupons” to the groups.
Part II: Conjoint Analysis with a Linear Model: (4 points)
Lobster Land had a very successful 2023 season! However, park management is always
thinking about new ways to use the park to increase visitor engagement (and revenue!)
One idea recently suggested to Lobster land was to try an Augmented Reality (AR)
Treasure Hunt.
Although Lobster Land is only open from Memorial Day to Labor Day each year, there
are some beautiful days in Maine before Memorial Day, and after Labor Day. Lobster
Land could open its doors on some of those days – they would allow visitors to
use the park for the AR Treasure Hunt, but would not actually operate any of the
regular rides on these days.
To gather more information before moving ahead, the park conducted some survey
Research. They asked a general sample of the population near Portland, Maine about
their AR treasure hunt preferences. Each survey respondent saw a random sample of 5
possible options, or bundles, and was asked to rate those bundles from 1-10. By giving
this survey to many thousands of people, Lobster Land was able to generate this dataset.
The treasure_hunt.csv dataset contains 1944 rows — one each for each of the unique
feature combinations that the park tested. It also contains average ratings for each
combination.
Park management needs your help! Of course, the park could just rank the
combinations to quickly see which combination was most popular overall among
respondents, but they are hoping that you can do some conjoint analysis to help
them gain deeper, more meaningful insights about people’s preferences regarding
particular features and options.
This dataset contains the following variables:
bundleID
This is a unique integer value from 1 to 1944 that identifies each separate bundle.
narrative
Respondents had three choices here. A simple narrative means that it’s a straightforward
treasure hunt, with a basic storyline. A moderate narrative means that there is a detailed
storyline, with some twists and character backstories. A complex narrative means that
there is an intricate plot involving multiple characters, significant plot twists, and deeper
lore.
duration
Respondents had three choices here: 30 minutes, 60 minutes, or 90 minutes
theme
Respondents had three choices here: Pirate, Jungle Adventure, or Space Odyssey
reward_type
The treasure hunt prizes can come in one of two forms – digital or physical. The key
difference is a physical prize is something that could be touched and/or carried home. A
physical prize could be a stuffed animal, a paper coupon to something in the park, a
souvenir book, etc. A digital prize could be something like a digital ticket or coupon, or a
treasure hunt badge.
space_integration
Respondents had three options here. “Low: means that the treasure hunt is primarily
focused on the digital screen, with minimal reference to the surrounding environment.
“Medium” means that clues and tasks require interaction with specific physical landmarks
or locations within the park. The treasure hunt designers describe the “high” option as a
“seamless blend of digital and physical elements, where AR elements interact dynamically
with the environment.”
collaboration
Respondents had two options here – either a solo treasure hunt design, or a collaborative,
team-based design.
customization
Respondents saw two options here – standard and dynamic. A standard environment is
the same for each player, and with each gameplay opportunity. A dynamic design can
incorporate the player’s expressed preferences and previous gameplay history.
participant role
Respondents saw three options here: explorer, detective, and hero. For the “explorer”
option, participants search for hidden treasures across the park. For the “detective”
option, participants solve mysteries or crimes using AR clues. For the “hero” option,
participants embark on a quest to save the park or a character, facing challenges and
villains.
avg_rating
This is the average rating that the bundle received, on a score from 0 to 10.
A. Read the dataset treasure_hunt.csv into your local environment in Jupyter
Notebook or Colab.
B. Based on the descriptions shown above, which of your variables are numeric,
and which are categorical? (The standard you should use when answering this is
that something that is both represented by a number, and for which that number
has valid mathematical meaning, is numeric).
C.
After first removing the bundleID variable, use the pandas get_dummies()
function in order to prepare the remaining variables for use in a linear model. Inside
this function, include this argument: drop_first = True. Doing this will save us from
the multicollinearity problem that would make our model unreliable. Be sure to
dummify ALL of your input variables, even the numeric ones.
a. Why should the numeric input variables based on this survey data be
dummified?
D. Build a linear model with your data, using the average rating as the outcome
variable, and with all of your other variables as inputs.
E. Display the coefficient values of your model inputs.
F. Write a few paragraphs for Lobster Land management about what your model is
showing you.
It would be good here to include some detail about which features seemed to be
most popular, or least popular, among respondents. However, a truly thoughtful
answer to this question will go beyond simply listing the coefficients in order of
popularity. What OTHER insights can you draw from this? Is there anything else
that you think Lobster Land should consider before simply implementing the ‘most
popular’ options? Remember, Lobster Land hired you as a consultant — don’t be
afraid to show some creativity here.
“I can’t answer this because I’m not sure what this variable means” = NOT the way to go here.
You can also reach out to Prof. Page or to any of the course TAs with any questions about anything in
this dataset or prompt.
You can use either statsmodels or scikit-learn to build the model. If you use statsmodels, you may see
high p-values for individual levels of categorical variables – but keep all the variables you used at Step D.
Keep your analytical focus here on the coefficients. Every categorical variable in this model adds
significance. We are focusing here on interpretation of coefficient values – not on overall model fit or
predictive power.
Part III: Wildcard: Marketing & Segments (1 point)
A. Find ANY advertisement…ANYWHERE. As you walk around in your daily life, you
might look for an ad on the side of the T, on a bus stop, on a poster, etc. Alternatively,
you could use an advertisement that you encounter while browsing the web.
a. Take a picture of the ad that you see (if it’s in the ‘real world’). Or, if the ad you
select is online, take a screenshot from your phone or your laptop to capture
this advertisement.
b. Write ONE thoughtful paragraph that addresses the issue of segmentation.
What consumer segment is your ad targeting? What makes you think this?
What types of consumers are in the segment? Are you part of the segment? Or,
alternatively, is your ad an undifferentiated (mass market) ad? Finally, what is
your opinion of this ad – is it effective?
You can embed your image, along with your paragraph write-up, in a Markdown cell in Jupyter
Notebook. Alternatively, you could upload your image and paragraph in a separate file, such as
a Word doc. The ad can be in any language – but if it’s not in English, please translate.
AD654: Marketing Analytics
Boston University
Assignment #2
Assignment 2: Market Segmentation & Conjoint Analysis
For this assignment, you will need two files: lobster_fans.csv and treasure_hunt.csv, each of
which can be found on our course Blackboard page.
For Parts I & II of this assignment, you will upload two files into Blackboard: The .ipynb file that
you create in Jupyter Notebook, and a PDF. There’s no need to create a ZIP or an RAR.
Lobster Land management prefers a PDF plus an .ipynb, so that the submission can
be directly read in Blackboard. Do not worry if there’s an issue with the way the PDF
renders; if something is missing, your Prof or TA will look into the ipynb.
For any question that asks you to perform some particular task, you just need to show your
input and output in Jupyter Notebook or Colab. Tasks will always be written in regular,
non-italicized font.
For any question that asks you to include interpretation, write your answer in a Markdown cell
in Jupyter Notebook. Any homework question that needs interpretation will be written in
italicized font. Do not simply write your answer in a code cell as a comment, but use a
Markdown cell instead.
Remember to be resourceful! There are many helpful resources available to you, including the
video library, the lecture notes on Blackboard, recitation sessions with the course TAs, the
office hours sessions, and the web.
Part I: Segmentation (5 points)
I.
As we roll through the rest of winter and into spring, Lobster Land is thinking about its
off-season marketing approach. To do so, Lobster Land wishes to employ your analytical skills.
Lobster Land has gathered some data from a sample of 654 visitors from last season. That data
is stored in a file called lobster_fans.csv. Now, Lobster Land seeks your help – can you identify
some interesting / meaningful segments from this data, and suggest specific ways to reach out to
them via email?
guestID
Each visitor in the dataset has a unique ID number from 1 to 654.
homestate
The homestate of the visitor
visits_2023
The total number of visits to Lobster Land from this person between Memorial
Day and Labor Day, 2023.
social_pres
This is an estimate of the person’s total social media usage, on a percentile
scale. A person with a value of 99 uses social media more than 99% of the
population in general, whereas a person with a value of 1 almost never uses
social media.
avg_duration
This is the average time, in minutes, that the person spent inside Lobster Land
during their 2023 visits to the park.
avg_rides_dry
This is the average number of non-water rides taken by the person, per visit,
during the 2023 season.
avg_rides_water
This is the average number of water rides, and water-adjacent rides, taken by
the person during the 2023 season (note: a “water-adjacent” ride is one in
which the rider may pass over or through water but is not likely to get wet.
total_merch
This is the passholder’s total spending on merchandise sold at the park in 2023.
referral codes
This is a count variable. During the 2023 season, visitors received a unique QR
code for referrals – if they could get a new visitor to visit Lobster Land, after
purchasing a ticket and using that code, both the new visitor and the refererer
received a 50% admissions coupon for use during that season.
total_snack_shack
This is the person’s total Snack Shack spending across the 2023 season.
total_gold_zone
This is the person’s total Gold Zone spending across the 2023 season.
A. Drop the guestID variable.
a. Why will guestID not be relevant in a clustering model? In your answer,
do not just write “it will confuse the model.” Instead, take the time to
explain this with a sentence or two, using a bit of math and your
understanding of Euclidean distance.
B. Call the describe() function on your dataset.
a. How does this function help you to gain an overall sense of the columns
and values in this (or any other) dataset? Why is this valuable for any
analyst who will use a dataset to build a model?
C. Missing values/impossible values
a. Does this dataset contain any missing values? If so, how many? Which
columns have missing values?
b. What about impossible values? Do you see any impossible values here?
If so, handle them in any way that you see fit. Why did you take this
approach?
D. Data scaling.
a. Do your variables need to be standardized? Why or why not?
b. If your data requires standardization, use Python to convert your values
into z-scores, and store the normalized data in a new dataframe. If not,
proceed to the next step without changing the variables.
E.