Linear Regression 
AIM: In this section we learn how to calculate the line of best fit mathematically. This line is also called the regression line.
Recall
that a straight line can be expressed in the form of
y = mx + c
where m is gradient (slope) of the straight line and (0,c) is the yintercept.
In
calculating linear regression, statisticians like to rewrite the above equation
slightly different. They prefer to use b for m and a for c.
Thus,
y = a + bx
where (0,a) is the yintercept and b the gradient of the line.
Hence, to find the line of best fit now, we need to find the value of a and b.
The gradient b is calculated as follows:

∑xy (1/n)(∑x)(∑y) 
b = 


∑x^{2} (1/n)(∑x)^{2} 
The
value of a is found by employing the fact that the line of best fit always
passes through (,).
=
a + b
a = 
b
and the value of b is obtained from point 4 above.
A
more compact way to express the equation of regression line for y on x is
y 
= (S_{xy}/S_{xx})(x  )
where
S_{xy} is the covariance, S_{xx} is the variance of x and
S_{x} is the standard deviation of x. Similarly, S_{yy}
is the variance of y and S_{y} is the standard deviation of y.
Thus, b can be written as b = S_{xy}/S_{xx}.
Let us look at our previous example (Excel file with Macro).
X height 
Y weight 
178 
53 
174 
50 
180 
58 
182 
60 
190 
70 
195 
85 
165 
45 
168 
48 
173 
51 
175 
60 
∑x=1780 
∑y=580 
Hence =1780/10 = 178 and =580/10 = 58. The red spot in the diagram is the point(187,58).
From the Excel example above or GDC, we find that n=10, ∑xy=104181 and ∑x^{2} =317612.

(104181)  (1/10)(1780)(580) 
b = 


(317612)  (1/10)(1780)^{2} 
b=
941/772 ≈ 1.22 (3 s.f.)
The value of b is known as the regression coefficient of y on x,
in our case, weight on height.
Now,
we can calculate
a = (58) (941/772)178
a ≈ 159 (3 s.f.)
The line of best fit for this
example (see diagram) is y = 159 + 1.22x
Note: The red spot with (mean of x,mean y) is always in the best fit line.
[IB likes to test students whether or not they know the best fit line contains
the point of (,
).]
According to Crawshaw and Chambers (2001), the regression line above gives us the average value of y for a given value of x. In the EXCEL program, this is regression line is known as "trend line." Knowing this mathematical equation, we can use it to predict or estimate missing values. Thus, we can use the equation for interpolation (estimating inside the range of the sample).
(a) So, if we know that Tommy's
height is 179 cm then we can predict his weight using the above equation.
y = 159 + 1.22(179)
y ≈ 59.4 kg (3 s.f.)
If Tommy's height is 179 cm then according to the best fit line his weight is
59.4 kg accurate to three significant figures.
(b) Similarly, if we know that Elizabeth's
weight is 55 kg then we can predict her height.
55 = 159 + 1.22(x)
x = (55+159)/1.22
x ≈ 175 cm (3 s.f.)
If Elizabeth's weight is 55 kg then according to the best fit line her height
is 175 cm accurate to three significant figures.
We must take care not to estimate values too far outside the range of your sample. That is, extrapolation will not always give us reliable results especially when the given x and y are "far away" from your sample.
The
above technique assume that x is the independent variable.
That is, x is a variable that we can controlled when we run an experiment.
In a chemistry experiment, the concentration of a particular solution can
be controlled when observing its rate of reaction with another solution.
Crawshaw and Chambers (2001) say that in a situation where neither x nor
y are controllable then we may want to estimate x for a given value of y
by using regression x on y and NOT use method (b) above.
Regression x on y is x = c + dy where

∑xy (1/n)(∑x)(∑y) 
d = 


∑y^{2} (1/n)(y)^{2} 
and c
= 
d.
Regression x on y is given
here as x = 136 + 0.731y (to 3 s.f.).
Exercises: 1. Calculate and confirm that the regression x on y using data in the above table is x = 136 + 0.731y (to 3 s.f.). 2. Use your regression x on y to estimate the height of Elizabeth if you know that her weight is 55kg. Answer = 176 cm. 
Crawshaw, J. and J. Chambers. A Concise Course in Advanced Level Statistics with worked examples. Cheltenham: Nelson Thornes, 2001.