Regression Model: step by step

Description

Re

Don't use plagiarized sources. Get Your Custom Assignment on
Regression Model: step by step
From as Little as $13/Page

Unformatted Attachment Preview

Chapter 4 – Regression Models
regression is an approach for modeling the relationship between a quantitative dependent variable y and one or more
explanatory variables (or independent variables) represented by X(s). The case of one explanatory variable is called
simple linear regression.
The main purposes of regression analysis are to understand relationship between/among variables and to predict one
variable based on the other(s).
Learning Objectives for this Chapter
At the completion of the Spring 2023 semester, students will be able to:
4.1 – Identify variables, visualize them using scatter diagram, and use them in a regression model
4.2 – Develop simple linear regression equations from a collected data and interpret the slope and intercept
4.3 – Compute the coefficient of Determination and the coefficient of correlation and interpret their meanings
4.4 – List assumptions used in regression and use residual plot to identify problems
4.5 – Test the model for significance
4.6 – Use Excel to do a regression analysis
4.7 – Develop a multiple regression model using excel and use it for prediction
We are going to use the example blow to go through learning objectives 4.1 to 4.6
A cafeteria at a local college would like to come up with a regression model that would predict what a student would
spend for lunch based on what they spent for breakfast. The collected data from a randomly selected students and the
result is shown below:
X(money spent on breakfast)
Y(money spent on lunch)
5
12
6
11
7
9
7
8
9
4
10
3
12
2
4.1 Scatter diagram
Y(money spent on lunch)
14
12
10
8
6
4
2
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13
Speculation: There seem to be a negative linear relationship between what a given student spends for breakfast and
lunch. For this scatter plot, $ breakfast is our input, independent, or explanatory variable, whereas $ lunch is our
output, dependent, or response variable.
4.2 – Developing a simple linear model
The simple linear regression model is Y = β0 + β1X + ε ➔ This is a general model
Where
Y is the dependent variable
X is the independent variable
β0 is the intercept (Y value when X = 0)
β1 is the slope of the regression line
ε is some random error
Simple regression model estimated for a data sample.
Ŷ = b0 + b1x➔ where b0 and b1 are estimated values of the intercept and slope assuming that error are at minimum.
Note that error still exist and can be tabulated by E = (actual value Y) – (predicted value ŷ)
Here,
Ŷ is the predicted value of Y
b0 is the predicted value of β0, based on a sample
b1 is the predicted value of β1, based on a sample
How to compute these values?
We need to first compute the following
The best way to do this is to develop a table (You can use excel)
X
5
6
7
7
9
10
12
56
Sum
X̄ = 56/7
Y
12
11
9
8
4
3
2
49
(X-X̄ )^2==> Explanation
(5-8)^2
(6-8)^2
(7-8)^2
(7-8)^2
(9-8)^2
(10-8)^2
(12-8)^2
(X-X̄ )^2
9
4
1
1
1
4
16
36
(X-X̄ )(Y-Ȳ)==> Explanation
(5-8)(12-7)
(X-X̄ )(Y-Ȳ)
-15
-8
-2
-1
-3
-8
-20
-57
8
Ȳ = 49/7
7
b1 = -57/36 ➔ -1.58
b0 = 7- (-1.58)(8) ➔ 19.67
so, the simple regression equation is
4.3 – Measuring the Fit of the Regression Model
To know for a fact that the model developed is good enough to be used for prediction, we must start by computing the
coefficient of Determination (R 2) and the coefficient of correlation ( r ).
To do that, we must first compute the following:


Sum of Square total or SST: This measures the total variability of Y about the mean.
SST = ∑(Y-Ȳ)2
Sum of Suare error or SSE: This Measures the variability of Y about the regression line.
SSE = ∑(e)2 ➔ ∑(Y-Ŷ)2

Sum of Squares Regression or SSR: This indicates how much total variability of Y can be explained by
the regression model.
SSR = ∑(Ŷ – Ȳ)2
Important relationship: Since SST = SSR+SSE, therefore SSR ➔ SST-SSE.
Sums
X
5
6
7
7
9
10
12
56
X̄ = 56/7
Ȳ = 49/7
8
7
Y
12
11
9
8
4
3
2
49
(Y-Ȳ)^2
25
16
4
1
9
16
25
96
SST
Ŷ = bo – b1 X here (ŷ = 19.67-1.58x)
11.77
10.19
8.61
8.61
5.45
3.87
0.71
(Y – Ŷ)^2
0.0529
0.6561
0.1521
0.3721
2.1025
0.7569
1.6641
5.7567
SSE
(Ŷ – Ȳ)^2
22.7529
10.1761
2.5921
2.5921
2.4025
9.7969
39.5641
89.8767
SSR
Coefficient of determination
The coefficient of determination (represented by R2) gives proportion of the variation in the dependent variable(Y) that
is predictable from the regression with the independent variable (X)
R2 = SSR/SST➔ This is also the same as 1 – SSE/SST
For our question, R2 = 89.8767/96 ➔ 0.936 or about 94%
Interpretation: About 94% of the variation in Y (money spent on lunch) can be explained by the regression with X
(money spent on breakfast). The remaining 6% are due to other fact (are due to error)
Coefficient of correlation
The quantity r, called the linear correlation coefficient, measures the strength and
the direction of a linear relationship between two variables.
r = +/- √ 2 ➔ important r has the same sign as the slope (b1) of the line of regression
Speculating on r based on the scatter diagram
The value of is such that -1 < r < +1. The + and – signs are used for positive linear correlations and negative linear correlations, respectively. Positive correlation: If x and y have a strong positive linear correlation, r is close to +1. An r value of exactly +1 indicates a perfect positive fit. Positive values indicate a relationship between x and y variables such that as values for x increases, values for y also increase. Negative Correlation: If x and y have a strong negative linear correlation, r is close to -1. An r value of exactly -1 indicates a perfect negative fit. Negative values indicate a relationship between x and y such that as values for x increase, values for y decrease. No Correlation: If there is no linear correlation or a weak linear correlation, r is close to 0. A value near zero means that there is a random, nonlinear relationship between the two variables Note that r is a dimensionless quantity; that is, it does not depend on the units employed. Perfect Correlation: A perfect correlation of r = ± 1 happens only when the data points all lie exactly on a straight line. If r = +1, the slope of this line is positive. If r = -1, the slope of this line is negative. Negative -1 Strong - .7Moderated - .5 Weak Positive 0 Weak .5 Moderated .7 Strong +1 In this case, r = -√0.936 = - 967 This is negative because the slope of the line of regression is also negative. Interpretation: there is a strong negative linear relationship between the amount of money spent on breakfast and the amount of money spent on lunch. 4.4 – Assumption of the Regression Model We stated earlier that the linear regression model comes with errors in it due to the fact that we are not dealing with perfectly aligned set of points… In other terms, the SSE in not always equal to 0, or R2 is not always 100%. Therefore, we have to make some assumptions about the errors in the regression model so that we can test it for significance. We must make the following assumptions about the errors: • • • • The errors are independent. The errors are normally distributed The errors have a mean of zero The errors have constant variance (Regardless of values of X) When assumptions are met, a plot of errors against the independent variable should appear to be random In our example, we are going to plot X against Residual (Y - Ŷ ) and check for randomness X 5 6 7 7 9 10 12 Ŷ = bo - b1 X here (ŷ = 19.67-1.58x) 11.77 10.19 8.61 8.61 5.45 3.87 0.71 Y 12 11 9 8 4 3 2 Residual (Y - Ŷ ) 0.23 0.81 0.39 -0.61 -1.45 -0.87 1.29 Using Excel Residual (Y - Ŷ ) 1.5 1 0.5 0 -0.5 0 2 4 6 8 10 12 14 -1 -1.5 -2 We can see that the scatter plot appears to be random – You can use figure 4.4A, 4.4B, and 4.4C on page 118 to check for likelihood of randomness. We want the residual plot to look like figure 4.4 a. The next step is to estimate the variance. While errors are assumed to have constant variance ( σ 2), it can only be estimated when a sample is collected. The Mean Squared Error (MSE or s2) is a good estimate of the population variance σ 2. S2 = MSE = SSE/(n-k-1), where n is the number of observations (pairs of points), and k it the number of independent variables. For our example, s2 = 5.7567/(7-1-1) = 1.15 For the sample variance, we can estimate the standard deviation by taking the square root of s2. Here, s = √1.15 = 1.07. This is also called the standard error estimate or standard deviation of the regression. 4.5 – Testing the Model for significance Steps for Hypothesis Testing 1. Determine the Null Hypothesis (H0) and the Alternative Hypothesis (H1). This is always H0 : ᵝI = 0➔ The correlation is 0 (The correlation is not significant) Ha : ᵝI ≠ 0➔ The correlation is 0 (The correlation significant) 2. Select the level of significance (Probability to reject Ho) . This is either 0.05 or 0.01 3. Compute the calculated value of F. For our course, we will read that value on the regression summary output. 4. Reject H0 if F calculated is greater than F critical (On F table)… and interpret the finding. How to read the F table… • Select a level of significance either 0.05 or 0.01 • Locate df1 or Degrees of freedom of the numerator (entry column on F table). DF1 is the number of independent variables K. • Locate df2 or Degrees of freedom of the denominator (entry row on F table). The value of dF 2 is by the n – k – 1 (Sample size – number of independent variables – 1). • The Critical value of f or F-critical (df1, df2) is going to be the number located at the junction of the identify entry row (df1) and the entry column (df2). For hour example Step 1 H0 : ᵝI = 0➔ The correlation is 0 (The correlation is not significant) Ha : ᵝI ≠ 0➔ The correlation is 0 (The correlation significant) Step 1 We are going to use α = 0.05 to test our hypothesis Step 3 Calculate the value of F statistic. Fcalculated = MSR/MSE MSR = SSR/k➔ 89.8767/1 = 89.8767 MSE = 1.15 Fcalculated = 89.8767/1.15 = 78.1536 Step 4: Decision: Reject Ho if the test statistics is greater than F critical (From the F table) df1 = k = 1 ➔ first column df2 = n-k-1 = 7-1-1 = 5➔ 5th row We are going to go to the F Table in appendix D, look for α = 0.05 (the first F distribution table) and go to first column and fifth row ➔ F0.05, 1, 5 = 6.61. Here, since Fcalculated of 78.1536 is greater than Fcritical of 6.61, we are going to reject Ho. Therefore, the regression model is significant. That is, prediction generated by the linear model ŷ = 19.67 – 1.58X will be reliable. Try this Additional examples Given the following pairs of points a. b. c. d. X Y 4 13 5 16 6 8 8 3 10 2 Draw a scatter diagram and speculate on the linear relationship between x and y Find the equation of the regression line Compute the coefficient of determination and tell us what that means Find the coefficient of correlation and determine the strength of the relationship between x and y e. Is the linear relationship significant? Use alpha of 0.05 to test this hypothesis for significance (For 25 points) Given the following pairs of points. a. Draw a scatter diagram and in a complete sentence speculate on the linear relationship between the two variables. X 13 14 14 17 9 10 9 18 Y 16 13 12 10 10 7 8 22 b. Find the coefficient of correlation and in a complete sentence determine the strength of the relationship between the two variables. (Show all work – Not just from Excel) c. Compute the coefficient of determination and in a complete sentence tell us what that means. d. Find the equation of the regression line. (Show all work – Not just from Excel) e. Is the regression between x and Y significant at 0.05? Why?(you can use Excel here) f. Use the linear regression equation to compute the value of Y when X is 14 Purchase answer to see full attachment