MAth1223-01

Description

Please use the data you originally collected for part 1. You will add these new parts to report part 2, 3, 4, 5, and 6.1. For this project, you must find some published or existing data. Possible sources include: almanacs, magazines and journal articles, textbooks, web resources, athletic teams, newspapers, professors with experimental data, campus organizations, electronic data repositories, etc. Your dataset must have at least 25 cases, two categorical variables and two quantitative variables. It is also recommended that you are interested in the material included in the dataset.2. Use the techniques of the text to repeat your hypothesis test.(a) Repeat your hypothesis test on the quantitative variable utilizing the appropriate formulas for your situation. Compute 95% con dence interval and compare to results from bootstrapping.3. Use the two-way table from part 2 to complete the following:(a) Create at least 2 conditional probabilities from your two-way table. Interpret their meanings and explain how they were computed. Include the following formula for conditional probability

Don't use plagiarized sources. Get Your Custom Assignment on
MAth1223-01
From as Little as $13/Page

Unformatted Attachment Preview

INTRO TO PROBABILITY AND STATS
Dr. Nicholas Jacobs
Are movies becoming more violent? Does MPAA rating influence the death count in a film?
What about genre? The following analysis will focus on a data set that exams death counts in
films (found: Data_Sets_For_Stats/CuratedDataSets/filmdeathcounts.csv at master ·
nurfnick/Data_Sets_For_Stats · GitHub). The data set provided is listing of 545 films containing
information on the name of the film, year, body count, MPAA rating (G, PG, PG-13, etc), genre,
directors, length (in minutes) and the movie rating from IMDB. This is a statistical sample,
rather than a population, as this is not a list of every movie that contains a death. With the large
sample size, statistically significant conclusions should be possible. The data set is large enough
that most of the data should be assumed to be normal.
In this analysis, the first variable used will be the body count, which provides quantitative
data about the number of deaths in the given film. The rating of each film, another quantitative
data set, will also be examined to see if more or less deaths are popular with the user rating from
IMDB, an online movie database. The MPAA Rating is also of interest. This is the motion
picture rating to indicate to viewers what ages the movie is appropriate for. This is a categorical
data set with the modern ratings going from G (general audiences), PG, PG-13, R, X or Unrated.
The final variable of interest is the genre of film. There are several different genre variables in
the data set. The first is a list of genres, which contains up to four different genres for any given
film. The next is Genre 1, which only lists the first genre listed from the first column. This will
primarily favor whichever genre comes first alphabetically, not which best describes the film.
So, the variable that will be used to investigate genres in general will be the Genre_Sci-Fi, that is
a categorical variable that is TRUE or FALSE. This variable will give data about the number of
science fiction films included in the data set.
Using this data set, analysis will look at which types of movies have the most deaths, if
certain ratings contain more violence or if movies with high death counts are more popular based
on the IMDB rating. Based on quick look through the data set, it appears most of the movies
with deaths are rated R, so that will probably be the most frequent MPAA rating. Data for the
five random movies are included below as a sample of the data set.
Film
Body Count
MPAA Rating
300
600
R
Die Hard
20
R
Dr. Strangelove 33
PG
Scarface
44
R
The Signal
78
R
Sample of data set containing 5 different films.
Genre1
Action
Action
Comedy
Crime
Horror
IMDB Rating
7.7
8.3
8.5
8.3
6.1
The frequency and relative frequency of the MPAA ratings in the data set are presented
below. Included are the modern ratings of G, PG, PG-13, R and Unrated. The other ratings
based on older systems are not included in the frequency tables below. All of the skipped
categories each contain less than 10 data points, so they were not included in the frequency table
below. The most frequent MPAA rating for this data set is R rated at 338 movies or 62% of the
data.
MPAA Rating Frequency
Relative Frequency
G
3
0.006
PG
35
0.064
PG-13
118
0.217
R
338
0.620
Unrated
28
0.051
Frequency and relative frequency of MPAA Rating.
The other categorical variable is if the films genre is science fiction or not. The
frequency and relative frequency for the variable is found in the table below. Most of the films
in the data set are not science fiction with 446 being FALSE representing 82% of the data.
Genre Sci-Fi
Frequency
Relative Frequency
TRUE
99
0.182
FALSE
446
0.818
Frequency and relative frequency of the genre of films being science fiction.
A two-way table comparing the MPAA ratings and if the films were science fiction or not
is presented with the frequency and relative frequency (given as percentages) below. Since
several of the categories for the MPAA ratings were not included, the two-way table only
includes data for 522 movies. The most popular category of movie is rated R for both science
fiction movies and movies in other genres. 55% of all the films are rated R and are not science
fiction. Since the data set contains only movies with deaths, it would make sense that not may
movies for children would contain any deaths, let alone multiple deaths. It also stands to reason
that most movies with deaths in them are more violent, therefore have higher MPAA ratings.
TRUE
FALSE
Total
G
1
2
3
PG
10
25
35
PG-13
34
84
118
R
51
287
338
Unrated
2
26
28
Total
98
424
522
Two-way table of frequency of the MPAA rating and the genre being science fiction (true or
false).
TRUE
FALSE
Total
G
0.19
0.38
0.57
PG
1.92
4.79
6.70
PG-13
6.51
16.09
22.61
R
9.77
54.98
64.75
Unrated
0.38
4.98
5.36
Total
18.77
81.23
100.00
Two-way table of relative frequency (as a percentage) of the MPAA rating and the genre being
science fiction (true or false)
One of the quantitative variables in the dataset is the body count. The summary statistics,
histogram, and boxplot are shown in the below, respectively:
Summary Statistics
Body Count
Mean
Standard deviation
Min
Q1
Median
Q3
Max
72.12
92.63
1
15
44
93
836
As we can see in the histogram, the body count data has a right-skewed distribution, meaning,
the mean is greater than the median. Moreover, it is visible in the boxplot that there are multiple
outliers that are above the maximum value, which indicates that there is the presence of unusual
or extremely high values.
Another quantitative variable is the IMDB rating. The summary statistics, histogram, and
boxplot are shown in the below, respectively:
Mean
Standard deviation
Min
Q1
Median
Q3
Max
IMDB_Rating
6.84
1.11
2
6.2
6.9
7.6
9.3
Based on the histogram, the IMDB rating data has a left-skewed distribution, meaning, the mean
is typically less than the median. In the boxplot, we can see that there are multiple outliers that
are below the minimum value, which indicates that there are extremely low values in the IMDB
rating data.
We are interested in comparing the mean body counts of two categories, such as different
MPAA ratings. It was mentioned that it appears most of the movies with deaths are rated R. So,
let us compare the mean body count for PG-13-rated movies to R-rated movies to determine if
the mean body count for PG-13-rated movies is lower than that of R-rated movies. The null and
alternative hypotheses are as follows:
H0: 1 = 2
H : 1 < 2 Where 1 is the mean body count for PG-rated movies and 2 is the mean body count for Rrated Movies. We are also interested in the Sci-Fi genre. Let us compare the proportion of movies with A Sci-Fi genre to those without a Sci-Fi genre to determine if there is a significant difference in The proportion of movies with a Sci-Fi genre compared to movies without a Sci-Fi genre. The Null and alternative hypotheses are as follows: H0: 1 = 2 SH : 1 < 2 Where 1 is the proportion of movies with a Sci-Fi genre and 2 is the proportion of movies without a Sci-Fi genre. When I refer to my dataset, am in a position to view these ratings G, PG, PG-13, R, and all the others. I will be counting how often each rating appears calculating and how often each ratio is in use, trying to figure out the commonality of each rating. 1. Counting items (Frequency): - In my data set am showing how frequently each of these movie ratings shows up in the information. - Null Hypothesis 1 (G0): The average number of the instances when the given rating appears is the same. - Hypothesis 2 (G1): The average number of the ratings per occurrence is varying 2. Sorting the MPAA Ratings: - I show ratings from "G" to "X" which I also show their frequency. - Null Hypothesis 1 (G0) : To my knowledge, the rating frequency is equal across the rating types. - Hypothesis 2 (G1): On some occasions, some ratings appear too much or lesser. This Hypothesis allow me to continue with statistical tests. Bootstrap (a) Bootstrapping for Quantitative Variable: Standard Error: Standard error (SE) is a statistic that reveals how accurately sample data represents the whole population. (Kenton W. 2024). - Bootstrap Mean Frequency: 95.6 (Mean of bootstrapped frequencies) - Standard Error: 2.4 (Standard deviation of bootstrapped frequencies) 95% Confidence Interval: - Confidence Interval: (91.2, 100.0) Decision: - Fail to reject null hypothesis. Histogram: (b) Bootstrapping for Categorical Variable: Standard Error: - Bootstrap Mean Relative Frequency: 0.151 (Mean of bootstrapped relative frequencies) - Standard Error: 0.056 (Standard deviation of bootstrapped relative frequencies) 95% Confidence Interval: - Confidence Interval: (0.090, 0.211) Decision: - Reject null hypothesis. Histogram: Categorical Variable - Conventional Methods: Standard Error: - Standard Error: 0.058 - 95% Confidence Interval: (0.037, 0.117) Decision: - Reject null hypothesis. Comparison: - My bootstrapped and conventional methods yield similar results, supporting the rejection of the null hypothesis for the categorical variable. Purchase answer to see full attachment