Description
Re
Unformatted Attachment Preview
Chapter 4 – Regression Models
regression is an approach for modeling the relationship between a quantitative dependent variable y and one or more
explanatory variables (or independent variables) represented by X(s). The case of one explanatory variable is called
simple linear regression.
The main purposes of regression analysis are to understand relationship between/among variables and to predict one
variable based on the other(s).
Learning Objectives for this Chapter
At the completion of the Spring 2023 semester, students will be able to:
4.1 – Identify variables, visualize them using scatter diagram, and use them in a regression model
4.2 – Develop simple linear regression equations from a collected data and interpret the slope and intercept
4.3 – Compute the coefficient of Determination and the coefficient of correlation and interpret their meanings
4.4 – List assumptions used in regression and use residual plot to identify problems
4.5 – Test the model for significance
4.6 – Use Excel to do a regression analysis
4.7 – Develop a multiple regression model using excel and use it for prediction
We are going to use the example blow to go through learning objectives 4.1 to 4.6
A cafeteria at a local college would like to come up with a regression model that would predict what a student would
spend for lunch based on what they spent for breakfast. The collected data from a randomly selected students and the
result is shown below:
X(money spent on breakfast)
Y(money spent on lunch)
5
12
6
11
7
9
7
8
9
4
10
3
12
2
4.1 Scatter diagram
Y(money spent on lunch)
14
12
10
8
6
4
2
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13
Speculation: There seem to be a negative linear relationship between what a given student spends for breakfast and
lunch. For this scatter plot, $ breakfast is our input, independent, or explanatory variable, whereas $ lunch is our
output, dependent, or response variable.
4.2 – Developing a simple linear model
The simple linear regression model is Y = β0 + β1X + ε ➔ This is a general model
Where
Y is the dependent variable
X is the independent variable
β0 is the intercept (Y value when X = 0)
β1 is the slope of the regression line
ε is some random error
Simple regression model estimated for a data sample.
Ŷ = b0 + b1x➔ where b0 and b1 are estimated values of the intercept and slope assuming that error are at minimum.
Note that error still exist and can be tabulated by E = (actual value Y) – (predicted value ŷ)
Here,
Ŷ is the predicted value of Y
b0 is the predicted value of β0, based on a sample
b1 is the predicted value of β1, based on a sample
How to compute these values?
We need to first compute the following
The best way to do this is to develop a table (You can use excel)
X
5
6
7
7
9
10
12
56
Sum
X̄ = 56/7
Y
12
11
9
8
4
3
2
49
(X-X̄ )^2==> Explanation
(5-8)^2
(6-8)^2
(7-8)^2
(7-8)^2
(9-8)^2
(10-8)^2
(12-8)^2
(X-X̄ )^2
9
4
1
1
1
4
16
36
(X-X̄ )(Y-Ȳ)==> Explanation
(5-8)(12-7)
(X-X̄ )(Y-Ȳ)
-15
-8
-2
-1
-3
-8
-20
-57
8
Ȳ = 49/7
7
b1 = -57/36 ➔ -1.58
b0 = 7- (-1.58)(8) ➔ 19.67
so, the simple regression equation is
4.3 – Measuring the Fit of the Regression Model
To know for a fact that the model developed is good enough to be used for prediction, we must start by computing the
coefficient of Determination (R 2) and the coefficient of correlation ( r ).
To do that, we must first compute the following:
•
•
Sum of Square total or SST: This measures the total variability of Y about the mean.
SST = ∑(Y-Ȳ)2
Sum of Suare error or SSE: This Measures the variability of Y about the regression line.
SSE = ∑(e)2 ➔ ∑(Y-Ŷ)2
•
Sum of Squares Regression or SSR: This indicates how much total variability of Y can be explained by
the regression model.
SSR = ∑(Ŷ – Ȳ)2
Important relationship: Since SST = SSR+SSE, therefore SSR ➔ SST-SSE.
Sums
X
5
6
7
7
9
10
12
56
X̄ = 56/7
Ȳ = 49/7
8
7
Y
12
11
9
8
4
3
2
49
(Y-Ȳ)^2
25
16
4
1
9
16
25
96
SST
Ŷ = bo – b1 X here (ŷ = 19.67-1.58x)
11.77
10.19
8.61
8.61
5.45
3.87
0.71
(Y – Ŷ)^2
0.0529
0.6561
0.1521
0.3721
2.1025
0.7569
1.6641
5.7567
SSE
(Ŷ – Ȳ)^2
22.7529
10.1761
2.5921
2.5921
2.4025
9.7969
39.5641
89.8767
SSR
Coefficient of determination
The coefficient of determination (represented by R2) gives proportion of the variation in the dependent variable(Y) that
is predictable from the regression with the independent variable (X)
R2 = SSR/SST➔ This is also the same as 1 – SSE/SST
For our question, R2 = 89.8767/96 ➔ 0.936 or about 94%
Interpretation: About 94% of the variation in Y (money spent on lunch) can be explained by the regression with X
(money spent on breakfast). The remaining 6% are due to other fact (are due to error)
Coefficient of correlation
The quantity r, called the linear correlation coefficient, measures the strength and
the direction of a linear relationship between two variables.
r = +/- √ 2 ➔ important r has the same sign as the slope (b1) of the line of regression
Speculating on r based on the scatter diagram
The value of is such that -1 < r < +1. The + and – signs are used for positive
linear correlations and negative linear correlations, respectively.
Positive correlation: If x and y have a strong positive linear correlation, r is close
to +1. An r value of exactly +1 indicates a perfect positive fit. Positive values indicate a relationship between x and y
variables such that as values for x increases, values for y also increase.
Negative Correlation: If x and y have a strong negative linear correlation, r is close
to -1. An r value of exactly -1 indicates a perfect negative fit. Negative values
indicate a relationship between x and y such that as values for x increase, values
for y decrease.
No Correlation: If there is no linear correlation or a weak linear correlation, r is
close to 0. A value near zero means that there is a random, nonlinear relationship between the two variables
Note that r is a dimensionless quantity; that is, it does not depend on the units
employed.
Perfect Correlation: A perfect correlation of r = ± 1 happens only when the data points all lie exactly on a straight line. If
r = +1, the slope of this line is positive. If r = -1, the slope of this line is negative.
Negative
-1 Strong - .7Moderated - .5 Weak
Positive
0 Weak
.5 Moderated .7 Strong +1
In this case, r = -√0.936 = - 967
This is negative because the slope of the line of regression is also negative.
Interpretation: there is a strong negative linear relationship between the amount of money spent on breakfast and
the amount of money spent on lunch.
4.4 – Assumption of the Regression Model
We stated earlier that the linear regression model comes with errors in it due to the fact that we are not dealing with
perfectly aligned set of points… In other terms, the SSE in not always equal to 0, or R2 is not always 100%. Therefore, we
have to make some assumptions about the errors in the regression model so that we can test it for significance. We
must make the following assumptions about the errors:
•
•
•
•
The errors are independent.
The errors are normally distributed
The errors have a mean of zero
The errors have constant variance (Regardless of values of X)
When assumptions are met, a plot of errors against the independent variable should appear to be random
In our example, we are going to plot X against Residual (Y - Ŷ ) and check for randomness
X
5
6
7
7
9
10
12
Ŷ = bo - b1 X here (ŷ =
19.67-1.58x)
11.77
10.19
8.61
8.61
5.45
3.87
0.71
Y
12
11
9
8
4
3
2
Residual (Y - Ŷ )
0.23
0.81
0.39
-0.61
-1.45
-0.87
1.29
Using Excel
Residual (Y - Ŷ )
1.5
1
0.5
0
-0.5
0
2
4
6
8
10
12
14
-1
-1.5
-2
We can see that the scatter plot appears to be random – You can use figure 4.4A, 4.4B, and 4.4C on page 118 to check
for likelihood of randomness. We want the residual plot to look like figure 4.4 a.
The next step is to estimate the variance.
While errors are assumed to have constant variance ( σ 2), it can only be estimated when a sample is collected. The
Mean Squared Error (MSE or s2) is a good estimate of the population variance σ 2.
S2 = MSE = SSE/(n-k-1), where n is the number of observations (pairs of points), and k it the number of independent
variables.
For our example, s2 = 5.7567/(7-1-1) = 1.15
For the sample variance, we can estimate the standard deviation by taking the square root of s2.
Here, s = √1.15 = 1.07. This is also called the standard error estimate or standard deviation of the regression.
4.5 – Testing the Model for significance
Steps for Hypothesis Testing
1. Determine the Null Hypothesis (H0) and the Alternative Hypothesis (H1).
This is always
H0 : ᵝI = 0➔ The correlation is 0 (The correlation is not significant)
Ha : ᵝI ≠ 0➔ The correlation is 0 (The correlation significant)
2. Select the level of significance (Probability to reject Ho) . This is either 0.05 or 0.01
3. Compute the calculated value of F. For our course, we will read that value on the regression
summary output.
4. Reject H0 if F calculated is greater than F critical (On F table)… and interpret the finding.
How to read the F table…
•
Select a level of significance either 0.05 or 0.01
•
Locate df1 or Degrees of freedom of the numerator (entry column on F table). DF1 is the number of
independent variables K.
•
Locate df2 or Degrees of freedom of the denominator (entry row on F table). The value of dF 2 is by
the n – k – 1 (Sample size – number of independent variables – 1).
•
The Critical value of f or F-critical (df1, df2) is going to be the number located at the junction of the
identify entry row (df1) and the entry column (df2).
For hour example
Step 1
H0 : ᵝI = 0➔ The correlation is 0 (The correlation is not significant)
Ha : ᵝI ≠ 0➔ The correlation is 0 (The correlation significant)
Step 1 We are going to use α = 0.05 to test our hypothesis
Step 3 Calculate the value of F statistic. Fcalculated = MSR/MSE
MSR = SSR/k➔ 89.8767/1 = 89.8767
MSE = 1.15
Fcalculated = 89.8767/1.15 = 78.1536
Step 4: Decision: Reject Ho if the test statistics is greater than F critical (From the F table)
df1 = k = 1 ➔ first column
df2 = n-k-1 = 7-1-1 = 5➔ 5th row
We are going to go to the F Table in appendix D, look for α = 0.05 (the first F distribution table) and go to first column
and fifth row ➔ F0.05, 1, 5 = 6.61.
Here, since Fcalculated of 78.1536 is greater than Fcritical of 6.61, we are going to reject Ho. Therefore, the regression model
is significant. That is, prediction generated by the linear model ŷ = 19.67 – 1.58X will be reliable.
Try this
Additional examples
Given the following pairs of points
a.
b.
c.
d.
X
Y
4
13
5
16
6
8
8
3
10
2
Draw a scatter diagram and speculate on the linear relationship between x and y
Find the equation of the regression line
Compute the coefficient of determination and tell us what that means
Find the coefficient of correlation and determine the strength of the relationship between x and y
e. Is the linear relationship significant? Use alpha of 0.05 to test this hypothesis for significance
(For 25 points) Given the following pairs of points.
a. Draw a scatter diagram and in a complete sentence speculate on
the linear relationship between the two variables.
X
13
14
14
17
9
10
9
18
Y
16
13
12
10
10
7
8
22
b. Find the coefficient of correlation and in a complete sentence determine the strength of the
relationship between the two variables. (Show all work – Not just from Excel)
c. Compute the coefficient of determination and in a complete sentence tell us what that means.
d. Find the equation of the regression line. (Show all work – Not just from Excel)
e. Is the regression between x and Y significant at 0.05? Why?(you can use Excel here)
f.
Use the linear regression equation to compute the value of Y when X is 14
Purchase answer to see full
attachment