More on Linear Regression

AIMS: By the end of this section, you will be able to
(i) carry out a linear regression with Excel,
(ii) carry out a linear regression with GDC,
(iii) report the relevant information in a document, and
(iv) provide an interpretation of the results.

  1. Linear regression can be easily carried out with Excel programme. We will simply enter our data X and Y into two columns in Excel. Here we look at an example where Kelvin carried out an experiment with certain solutions. He recorded the temperature and yield from the chemical reaction.


    Temperature in Celsius (X)


    Yield in gramme (Y)

    150

    2.1

    160

    3.6

    170

    3.1

    180

    3.5

    190

    3.6

    200

    4.1

    210

    5.5

    240

    5.9

     

  2. In Excel, we can do linear regression with analysis by employing "The Analysis ToolPak."
    To access The Analysis ToolPak, click Data Analysis in the Analysis group on the Data tab. If the Data Analysis command is not available, you need to load the Analysis ToolPak add-in program.

    Load the Analysis ToolPak
    1.Click the Microsoft Office Button and then click Excel Options.
    2.Click Add-ins, and then in the Manage box, select Excel Add-ins.
    3.Click Go.
    4.In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK.
    Tip: If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to locate it.

    If you are prompted that the Analysis ToolPak is not currently installed on your computer, click Yes to install it.

    To access The Analysis ToolPak, click Data Analysis in the Analysis group on the Data tab and select Regression (see Excel Screen 1 below). Then you select the appropriate column as your x and y as in Excel Screen 2 below. Later select the box "Label" if the first row of the above table with headings is also selected.

    Excel Screen 1.
    Excel Screen 2.

  3. Here is the Excel output.

  4. The coefficient of correlation is given as "Multiple R" in this output (r ≈ 0.928).

  5. R square is just the square of coefficient of correlation. In this example, r2 0.862.
    If r2 = 1 then our data fit the regression line perfectly and the line has positive slope. If r2 = -1 then our data fit the regression line perfectly and the line has negative slope. A no fit is suggested by r2 = 0.
    The value r2
    tell us how the regressor (X variable) explains the variation in the dependent variable (Y). Thus, r2 tells us how best these data fit a linear line.

  6.  

    Adjusted R square is used when you have two or more regressors (independent or x variables).
    Adjusted r2 = r2 - (1-r2)(k-1)/(n-k)
    where k is the number of independent variables or Xs, and n is the number of observations (also given in the output above).
    In this case, we only have one X variable so we will not need to use adjusted r
    2.

  7. The intercept value (a) is -3.50 and the coefficient for temperature (b) is 0.04.
    That is Y = -3.50 + 0.04X.

  8. We then need to consider whether -3.50 and 0.04 are significant or not. We can determine this by considering their P-values. The reported P-value in EXCEL is actually P-value = P(|calculated t stat|<t) that is the area underneath the curve at both ends of the tail. The P-value for intercept is 0.029. In another word, there is only 2.9 % chance of wrongly rejected the null hypothesis that the coefficient is zero when indeed this is the case (the null hypothesis is actually true). In statistics, we often take a value to be significant if this p-value is less than 0.05. That is, 5 % risk of wrongly rejecting the null hypothesis when the null hypothesis is true is normally considered as acceptable. A risk-taker could decide to raise this to 10% but very rarely statisticians will consider anything above 10% as acceptable. Looking at the above p-values, we see that the coefficient of intercept is not significant at 5% level that it is not significantly different from zero. The coefficient of temperature is however significant at 5% level.

    Read more about T-test for coefficient.

  9. The significance F in the ANOVA section tells us whether or not the model, in this case Y = a + β X is significant?
    For the model to be significant, we need to reject the hypothesis that
    β=0 . Statisticians will usually reject the null hypothesis that β=0 when the p-value of the F statistics is smaller than 0.05. Statisticians will sometimes consider a model of moderate significant if the p-value is less than 0.10. A p-value greater than 0.10 will almost always suggest that the linear model Y = a + β X is not acceptable. This F test is often used to test the null hypothesis that the true slope coefficients β2 = β3 = ... = βk in a linear model Y = β2+ β2X2 + β3X3 + ...+βkXk are simultaneously zero. The alternative hypothesis is the true slope coefficients are not simultaneously zero.

    In our case, this linear model is significant.

    UP

How do I do simple linear regression with GDC.

  1. Press [STAT] to obtain Screen 1 as shown here.

  2. Select 1:Edit and press [ENTER].

  3. If there are old data in list L1 and L2 then we need to clear these by the following:

    (i) Press [STAT] to return to Screen 1.
    (ii) Use the arrow to move down to 4:ClrList and press [ENTER]
    (iii) Type in [2nd][1] for List 1, [,] [2nd][2] for List 2 and [ENTER].
    (iv) Repeat step 1 and step 2 above.

  4. Enter the values of X into L1 and the values of Y into L2 as shown in Screen 2.

  5. Repeat step 1 above and use arrow to select CALC. Move down to 4:LinReg(ax+b) which stands for linear regression and press [Enter].

  6. When the prompt LinReg(ax+b) appears, press [2nd][1] [,][2nd][2][ENTER] as in the second line of Screen 7. Make sure the list contains the x values is entered first follow by the list that contains the y values. The results are displayed in Screen 4. But this does not show the values for r and r2.

  7. Press [2nd][0] for Catalog to obtain Screen 5. Press [Alpha] and a symbol A will appear as in Screen 5. Now press [x -1] to jump to the catalog list that starts with D. Move down the screen until we find “DiagnosticOn” as in Screen 6 and press [ENTER].

  8. DiagnosticOn will now show up on your screen. Press [ENTER] to execute it. Repeat step 5 above as in Screen 7.

    The results are displayed in Screen 8.


Screen 8 is sufficient for most problems that a Math Studies student will encounter.
If you are writing a research paper then I suggest that you use EXCEL rather than TI because TI calculator does not report the p-value of each coefficient.

Note: In step five, always enter the list containing the x values first follow by comma and finally the list that contains the y values into LinReg(ax+b).


    UP


How do I report my results?

  1. A diagram such as this.

  2. Report the equation

    Y =

    -3.50

    +

    0.040X

     

    p≈0.029

     

    p≈ 0.001

  3. Report r ≈ 0.928 with p < 0.005 from the table for r. Thus, our correlation is positive and significant.

  4. F and r values are usually reported to two decimal places. Having too many significant figures many not be meaningful.

  5. If your sample size is small (not more than 16) then coefficients should be reported to not more than 2 significant figures. If the sample size is greater than 100, coefficients can be reported up to 3 significant figures.

  6. If any of the p-values have “E-6” or less than we should report that as p < 0.00005.

    UP

What do these numbers mean?

  1. According to the regression results above, an increase of the temperature by one degree will increase the yield by 0.040 gramme. The coefficient of 0.040 has a p-value of 0.001 that is the coefficient is statistical significant at 0.01 % level. In another word, there is only 0.01 % chance of rejecting the hypothesis that this coefficient is no different from zero when it is actually true. Or we can say we accept a 0.01% chance that the coefficient is produced by random chance and is not different from zero. We are in fact considering a two-tailed t-test with Ho: βo = 0 and the alternative is βo ≠ 0 where βo is the population estimate of the coefficient.

  2. The coefficient of intercept has a p-value of 0.029 which means that probability of rejecting the hypothesis that it is zero when the hypothesis is actually true is only 2.9%. Or we can say we accept a 2.9% chance that the coefficient is produced by random chance and is not different from zero. We are in fact considering a two-tailed t-test with Ho: βo = 0 and the alternative is βo ≠ 0 where βo is the population estimate of the coefficient. This significance is actually bad news to our model as it tells us that there are relevant factors but unknown to us and thus not included in the model. In another word, the model here with only one factor called temperature is not adequate in modeling the experiment. If time permits, a good researcher should try to figure out what other relevant factors other than temperature will influence the yield of this chemical substance. If we have a coefficient of intercept that is not statistical significance then we have included all relevant factors into our model. Thus, a good researcher should create a model in which the coefficient of the intercept is statistically insignificant, say even at 10% significance level.

  3. Refer to the note entitled T-test 2 for coefficient for more about the above testing.

    UP

How do I produce a scatter diagram with its line of best fit?

  1. Simply block your X and Y data in the Excel spreadsheet as displayed below.

  2. Go to the top of the ruler and select the icon that looks like bar chart. This is the "Chart Wizard."

  3. You will then obtain this screen. Select "XY (Scatter)" because we want to plot a scatter diagram. In the Sub-type, select the highlighted option as portrayed here.

  4. Create your graph.

  5. After the graph is created, go to the top ruler and select "Chart-Add trendline."

  6. You will then see this screen. We will select "Linear" as our scatter diagram has linear shape.

  7. After that, click the "Option" button in the above screen. In order to obtain a nice line of best fit, we will go to "Forecast" and increase both "Forward" and "Backward" from zero to 1 unit. Then, we will select "Display equation on chart" and "Display R-squared value on chart."

  8. Click OK and you will obtain a diagram similar to the one above. You can select the equation and R-square and move them to an area that is legible.

    UP

Exercises:

1. A group of students were surveyed. They were asked their IB predicted grade in Mathematics and their SAT score in Mathematics.

IB predicted grade (X)

7

6

5

5

6

6

4

7

7

SAT score in maths (Y)

800

650

600

630

560

630

560

800

760

(i) Use Excel to obtain the regression line Y on X.
(ii) Also comment on the significant of this linear model and its coefficients.

2. Mr. Chau is a factory manager. His factory produces LCD screens. He recorded the following information for 10 consecutive months.

Number of LCD screens (X) in thousands

18

36

45

22

69

72

13

33

59

79

Total production cost (Y) in thousand dollars

37

54

63

42

84

91

33

49

79

98

(i) Find the regression line in the form of Y=a+bX. Also comment the significant of these coefficients.
(ii) Give a practical interpretation of the these coefficients.

 

Answers:1.(i) Y =218 + 76 X
(ii) p=0.0999 p=0.00556 F statistics p=0.00556.
2. (i)Y=19+0.98X (in thousands) or Y =19000 + 980X 
p<0.00005 p<0.00005
The p-value of F is also less than 0.00005. (ii) The intercept represents fixed cost. That is, the factory still has to pay $19000 per month even if it is not producing a single LCD screen. The gradient 980 means that each LCD screen adds $980 to the total production cost. Since this is a linear function, the marginal cost is also $980.

UP