More on Linear Regression
By the end of this section, you will be able to
(i) carry out a linear regression with Excel,
(ii) carry out a linear regression with GDC,
(iii) report the relevant information in a document, and
(iv) provide an interpretation of the results.
Linear regression can be easily carried out with Excel programme. We will simply enter our data X and Y into two columns in Excel. Here we look at an example where Kelvin carried out an experiment with certain solutions. He recorded the temperature and yield from the chemical reaction.
In Excel, we can do linear regression
with analysis by employing "The Analysis ToolPak."
To access The Analysis ToolPak, click Data Analysis in the Analysis group on the Data tab. If the Data Analysis command is not available, you need to load the Analysis ToolPak add-in program.
Load the Analysis ToolPak
1.Click the Microsoft Office Button and then click Excel Options.
2.Click Add-ins, and then in the Manage box, select Excel Add-ins.
4.In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK.
Tip: If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to locate it.
If you are prompted that the Analysis ToolPak is not currently installed on your computer, click Yes to install it.
The Analysis ToolPak, click Data Analysis in the Analysis group on the Data
tab and select Regression (see Excel Screen 1 below). Then you select the
appropriate column as your x and y as in Excel Screen 2 below. Later select
the box "Label" if the first row of the above table with headings
is also selected.
Excel Screen 1.
Excel Screen 2.
the Excel output.
The coefficient of correlation is given as "Multiple R" in this output (r ≈ 0.928).
is just the square of coefficient of correlation. In this example, r2
If r2 = 1 then our data fit the regression line perfectly and the line has positive slope. If r2 = -1 then our data fit the regression line perfectly and the line has negative slope. A no fit is suggested by r2 = 0.
The value r2 tell us how the regressor (X variable) explains the variation in the dependent variable (Y). Thus, r2 tells us how best these data fit a linear line.
Adjusted R square
is used when you have two or more
regressors (independent or x variables).
value (a) is -3.50 and the coefficient for temperature (b) is 0.04.
That is Y = -3.50 + 0.04X.
We then need to consider whether -3.50 and 0.04 are significant or not. We can determine this by considering their P-values. The reported P-value in EXCEL is actually P-value = P(|calculated t stat|<t) that is the area underneath the curve at both ends of the tail. The P-value for intercept is 0.029. In another word, there is only 2.9 % chance of wrongly rejected the null hypothesis that the coefficient is zero when indeed this is the case (the null hypothesis is actually true). In statistics, we often take a value to be significant if this p-value is less than 0.05. That is, 5 % risk of wrongly rejecting the null hypothesis when the null hypothesis is true is normally considered as acceptable. A risk-taker could decide to raise this to 10% but very rarely statisticians will consider anything above 10% as acceptable. Looking at the above p-values, we see that the coefficient of intercept is not significant at 5% level that it is not significantly different from zero. The coefficient of temperature is however significant at 5% level.
Read more about T-test for coefficient.
The significance F in the ANOVA section
tells us whether or not the model, in this case Y = a + β X is significant?
For the model to be significant, we need to reject the hypothesis that β=0 . Statisticians will usually reject the null hypothesis that β=0 when the p-value of the F statistics is smaller than 0.05. Statisticians will sometimes consider a model of moderate significant if the p-value is less than 0.10. A p-value greater than 0.10 will almost always suggest that the linear model Y = a + β X is not acceptable. This F test is often used to test the null hypothesis that the true slope coefficients β2 = β3 = ... = βk in a linear model Y = β2+ β2X2 + β3X3 + ...+βkXk are simultaneously zero. The alternative hypothesis is the true slope coefficients are not simultaneously zero.
In our case, this linear model is significant.
A diagram such as this.
Report the equation
Report r ≈ 0.928 with p < 0.005 from the table for r. Thus, our correlation is positive and significant.
F and r values are usually reported to two decimal places. Having too many significant figures many not be meaningful.
If your sample size is small (not more than 16) then coefficients should be reported to not more than 2 significant figures. If the sample size is greater than 100, coefficients can be reported up to 3 significant figures.
If any of the p-values have “E-6” or less than we should report that as p < 0.00005.
According to the regression results above, an increase of the temperature by one degree will increase the yield by 0.040 gramme. The coefficient of 0.040 has a p-value of 0.001 that is the coefficient is statistical significant at 0.01 % level. In another word, there is only 0.01 % chance of rejecting the hypothesis that this coefficient is no different from zero when it is actually true. Or we can say we accept a 0.01% chance that the coefficient is produced by random chance and is not different from zero. We are in fact considering a two-tailed t-test with Ho: βo = 0 and the alternative is βo ≠ 0 where βo is the population estimate of the coefficient.
The coefficient of intercept has a p-value of 0.029 which means that probability of rejecting the hypothesis that it is zero when the hypothesis is actually true is only 2.9%. Or we can say we accept a 2.9% chance that the coefficient is produced by random chance and is not different from zero. We are in fact considering a two-tailed t-test with Ho: βo = 0 and the alternative is βo ≠ 0 where βo is the population estimate of the coefficient. This significance is actually bad news to our model as it tells us that there are relevant factors but unknown to us and thus not included in the model. In another word, the model here with only one factor called temperature is not adequate in modeling the experiment. If time permits, a good researcher should try to figure out what other relevant factors other than temperature will influence the yield of this chemical substance. If we have a coefficient of intercept that is not statistical significance then we have included all relevant factors into our model. Thus, a good researcher should create a model in which the coefficient of the intercept is statistically insignificant, say even at 10% significance level.
Refer to the note entitled T-test 2 for coefficient for more about the above testing.
block your X and Y data in the Excel spreadsheet as displayed below.
Go to the top of the ruler and select the icon that looks like bar chart. This is the "Chart Wizard."
will then obtain this screen. Select "XY (Scatter)"
because we want to plot a scatter diagram. In the Sub-type, select
the highlighted option as portrayed here.
Create your graph.
After the graph is created, go to the top ruler and select "Chart-Add trendline."
will then see this screen. We will select "Linear" as our
scatter diagram has linear shape.
that, click the "Option" button in the above screen. In
order to obtain a nice line of best fit, we will go to "Forecast"
and increase both "Forward" and "Backward" from
zero to 1 unit. Then, we will select "Display equation on
chart" and "Display R-squared value on chart."
Click OK and you will obtain a diagram similar to the one above. You can select the equation and R-square and move them to an area that is legible.
1. A group of students were surveyed. They were asked their IB predicted grade in Mathematics and their SAT score in Mathematics.
(i) Use Excel to obtain the regression
line Y on X.
2. Mr. Chau is a factory manager. His factory produces LCD screens. He recorded the following information for 10 consecutive months.
(i) Find the regression line in
the form of Y=a+bX. Also comment the significant of these coefficients.
Y =218 + 76 X