More on Linear Regression 
AIMS:
By the end of this section, you will be able to
(i) carry out a linear regression with Excel,
(ii) carry out a linear regression with GDC,
(iii) report the relevant information in a document, and
(iv) provide an interpretation of the results.
Linear regression can be easily carried out with Excel programme. We will simply enter our data X and Y into two columns in Excel. Here we look at an example where Kelvin carried out an experiment with certain solutions. He recorded the temperature and yield from the chemical reaction.

In Excel, we can do linear regression
with analysis by employing "The Analysis ToolPak."
To access The Analysis ToolPak, click Data Analysis in the Analysis group
on the Data tab. If the Data Analysis command is not available, you need
to load the Analysis ToolPak addin program.
Load the Analysis ToolPak
1.Click the Microsoft Office Button and then click Excel Options.
2.Click Addins, and then in the Manage box, select Excel Addins.
3.Click Go.
4.In the AddIns available box, select the Analysis ToolPak check box, and
then click OK.
Tip: If Analysis ToolPak is not listed in the AddIns available box, click
Browse to locate it.
If you are prompted that the Analysis ToolPak is not currently installed on your computer, click Yes to install it.
To access
The Analysis ToolPak, click Data Analysis in the Analysis group on the Data
tab and select Regression (see Excel Screen 1 below). Then you select the
appropriate column as your x and y as in Excel Screen 2 below. Later select
the box "Label" if the first row of the above table with headings
is also selected.
Excel Screen 1. 
Excel Screen 2. 
Here is
the Excel output.
The coefficient of correlation is given as "Multiple R" in this output (r ≈ 0.928).
R square
is just the square of coefficient of correlation. In this example, r^{2
}≈ 0.862.
If r^{2 }= 1 then our data fit the regression line perfectly and
the line has positive slope. If r^{2 }= 1 then our data fit the
regression line perfectly and the line has negative slope. A no fit is suggested
by r^{2 }= 0.
The value r^{2} tell
us how the regressor (X variable) explains the variation in the dependent
variable (Y). Thus,
r^{2}
tells us how best these data fit a linear
line.
Adjusted R square
is used when you have two or more
regressors (independent or x variables). 
The intercept
value (a) is 3.50 and the coefficient for temperature (b) is 0.04.
That is Y = 3.50 + 0.04X.
We then need to consider whether 3.50 and 0.04 are significant or not. We can determine this by considering their Pvalues. The reported Pvalue in EXCEL is actually Pvalue = P(calculated t stat<t) that is the area underneath the curve at both ends of the tail. The Pvalue for intercept is 0.029. In another word, there is only 2.9 % chance of wrongly rejected the null hypothesis that the coefficient is zero when indeed this is the case (the null hypothesis is actually true). In statistics, we often take a value to be significant if this pvalue is less than 0.05. That is, 5 % risk of wrongly rejecting the null hypothesis when the null hypothesis is true is normally considered as acceptable. A risktaker could decide to raise this to 10% but very rarely statisticians will consider anything above 10% as acceptable. Looking at the above pvalues, we see that the coefficient of intercept is not significant at 5% level that it is not significantly different from zero. The coefficient of temperature is however significant at 5% level.
The significance F in the ANOVA section
tells us whether or not the model, in this case Y = a + β X is significant?
For the model to be significant, we need to reject the hypothesis
that β=0
. Statisticians
will usually reject the null hypothesis that β=0
when the pvalue of the F statistics is smaller than 0.05.
Statisticians will sometimes consider
a model of moderate significant if the pvalue is less than 0.10. A pvalue
greater than 0.10 will almost always suggest that the linear model Y = a
+ β X is not acceptable. This F test is often used to test the null
hypothesis that the true slope coefficients β_{2} = β_{3}
= ... = β_{k }in a linear model Y = β_{2}+ β_{2}X_{2}
+ β_{3}X_{3} + ...+β_{k}X_{k}
are simultaneously zero. The alternative hypothesis is
the true slope coefficients are not simultaneously zero.
In our case, this linear model is significant.


A diagram such as this.
Report the equation
Y = 
3.50 
+ 
0.040X 

p≈0.029 

p≈ 0.001 
Report r ≈ 0.928 with p < 0.005 from the table for r. Thus, our correlation is positive and significant.
F and r values are usually reported to two decimal places. Having too many significant figures many not be meaningful.
If your sample size is small (not more than 16) then coefficients should be reported to not more than 2 significant figures. If the sample size is greater than 100, coefficients can be reported up to 3 significant figures.
If any of the pvalues have “E6” or less than we should report that as p < 0.00005.
According to the regression results above, an increase of the temperature by one degree will increase the yield by 0.040 gramme. The coefficient of 0.040 has a pvalue of 0.001 that is the coefficient is statistical significant at 0.01 % level. In another word, there is only 0.01 % chance of rejecting the hypothesis that this coefficient is no different from zero when it is actually true. Or we can say we accept a 0.01% chance that the coefficient is produced by random chance and is not different from zero. We are in fact considering a twotailed ttest with H_{o}: β_{o} = 0 and the alternative is β_{o} ≠ 0 where β_{o} is the population estimate of the coefficient.
The coefficient of intercept has a pvalue of 0.029 which means that probability of rejecting the hypothesis that it is zero when the hypothesis is actually true is only 2.9%. Or we can say we accept a 2.9% chance that the coefficient is produced by random chance and is not different from zero. We are in fact considering a twotailed ttest with H_{o}: β_{o} = 0 and the alternative is β_{o} ≠ 0 where β_{o} is the population estimate of the coefficient. This significance is actually bad news to our model as it tells us that there are relevant factors but unknown to us and thus not included in the model. In another word, the model here with only one factor called temperature is not adequate in modeling the experiment. If time permits, a good researcher should try to figure out what other relevant factors other than temperature will influence the yield of this chemical substance. If we have a coefficient of intercept that is not statistical significance then we have included all relevant factors into our model. Thus, a good researcher should create a model in which the coefficient of the intercept is statistically insignificant, say even at 10% significance level.
Refer to the note entitled Ttest 2 for coefficient for more about the above testing.
Simply
block your X and Y data in the Excel spreadsheet as displayed below.
Go to the top of the ruler and select the icon that looks like bar chart. This is the "Chart Wizard."
You
will then obtain this screen. Select "XY (Scatter)"
because we want to plot a scatter diagram. In the Subtype, select
the highlighted option as portrayed here.
Create your graph.
After the graph is created, go to the top ruler and select "ChartAdd trendline."
You
will then see this screen. We will select "Linear" as our
scatter diagram has linear shape.
After
that, click the "Option" button in the above screen. In
order to obtain a nice line of best fit, we will go to "Forecast"
and increase both "Forward" and "Backward" from
zero to 1 unit. Then, we will select "Display equation on
chart" and "Display Rsquared value on chart."
Click OK and you will obtain a diagram similar to the one above. You can select the equation and Rsquare and move them to an area that is legible.
Exercises: 1. A group of students were surveyed. They were asked their IB predicted grade in Mathematics and their SAT score in Mathematics.
(i) Use Excel to obtain the regression
line Y on X. 2. Mr. Chau is a factory manager. His factory produces LCD screens. He recorded the following information for 10 consecutive months.
(i) Find the regression line in
the form of Y=a+bX. Also comment the significant of these coefficients.
Answers:1.(i)
Y =218 + 76 X 