Linear Regression

AIM: In this section we learn how to calculate the line of best fit mathematically. This line is also called the regression line.

  1. Recall that a straight line can be expressed in the form of
    y = mx + c
    where m is gradient (slope) of the straight line and (0,c) is the y-intercept.

  2. In calculating linear regression, statisticians like to rewrite the above equation slightly different. They prefer to use b for m and a for c. Thus,
    y = a + bx
    where (0,a) is the y-intercept and b the gradient of the line.

  3. Hence, to find the line of best fit now, we need to find the value of a and b.

  4. The gradient b is calculated as follows:

     

    xy -(1/n)(∑x)(∑y)

    b =


     

    x2 -(1/n)(∑x)2

  5. The value of a is found by employing the fact that the line of best fit always passes through (,).
    = a + b
    a = - b and the value of b is obtained from point 4 above.

  6. A more compact way to express the equation of regression line for y on x is
    y - = (Sxy/Sxx)(x - )
    where

    Sxy is the covariance, Sxx is the variance of x and Sx is the standard deviation of x. Similarly, Syy is the variance of y and Sy is the standard deviation of y.

  7. Thus, b can be written as b = Sxy/Sxx.

  8. Let us look at our previous example (Excel file with Macro).

    X height

    Y weight

    178

    53

    174

    50

    180

    58

    182

    60

    190

    70

    195

    85

    165

    45

    168

    48

    173

    51

    175

    60

    ∑x=1780

    ∑y=580

    Hence =1780/10 = 178 and =580/10 = 58. The red spot in the diagram is the point(187,58).

  9. From the Excel example above or GDC, we find that n=10, ∑xy=104181 and ∑x2 =317612.

     

    (104181) - (1/10)(1780)(580)

    b =


     

    (317612) - (1/10)(1780)2

    b= 941/772 ≈ 1.22 (3 s.f.)
    The value of b is known as the regression coefficient of y on x, in our case, weight on height.

  10. Now, we can calculate
    a = (58)- (941/772)178
    a ≈ -159 (3 s.f.)

  11. The line of best fit for this example (see diagram) is y = -159 + 1.22x
    Note: The red spot with (mean of x,mean y) is always in the best fit line.
    [IB likes to test students whether or not they know the best fit line contains the point of (, ).]

What is the use of this line of best fit equation?

According to Crawshaw and Chambers (2001), the regression line above gives us the average value of y for a given value of x. In the EXCEL program, this is regression line is known as "trend line." Knowing this mathematical equation, we can use it to predict or estimate missing values. Thus, we can use the equation for interpolation (estimating inside the range of the sample).

(a) So, if we know that Tommy's height is 179 cm then we can predict his weight using the above equation.
y = -159 + 1.22(179)
y ≈ 59.4 kg (3 s.f.)
If Tommy's height is 179 cm then according to the best fit line his weight is 59.4 kg accurate to three significant figures.

(b) Similarly, if we know that Elizabeth's weight is 55 kg then we can predict her height.
55 = -159 + 1.22(x)
x = (55+159)/1.22
x ≈ 175 cm (3 s.f.)
If Elizabeth's weight is 55 kg then according to the best fit line her height is 175 cm accurate to three significant figures.

Is there anything that I should be aware of when using this technique?

  1. We must take care not to estimate values too far outside the range of your sample. That is, extrapolation will not always give us reliable results especially when the given x and y are "far away" from your sample.

  2. The above technique assume that x is the independent variable. That is, x is a variable that we can controlled when we run an experiment. In a chemistry experiment, the concentration of a particular solution can be controlled when observing its rate of reaction with another solution. Crawshaw and Chambers (2001) say that in a situation where neither x nor y are controllable then we may want to estimate x for a given value of y by using regression x on y and NOT use method (b) above.
    Regression x on y is x = c + dy where

     

    xy -(1/n)(∑x)(∑y)

    d =


     

    y2 -(1/n)(y)2

    and c = - d.
    Regression x on y is given here as x = 136 + 0.731y (to 3 s.f.).

Exercises:

1. Calculate and confirm that the regression x on y using data in the above table is x = 136 + 0.731y (to 3 s.f.).

2. Use your regression x on y to estimate the height of Elizabeth if you know that her weight is 55kg.

Answer = 176 cm.

Reference:

Crawshaw, J. and J. Chambers. A Concise Course in Advanced Level Statistics with worked examples. Cheltenham: Nelson Thornes, 2001.