STAB22 section 2.4 2.82 Use the least-squares regression line given in Example 2.19. For an individual with NEA increase 143, the predicted fat gain is 3.505 − (0.00344)(143) = 3.013 kg. The actual fat gain, 3.2, is bigger than the predicted one (above the line), so the residual is positive (despite what the question says): 3.2 − 3.013 = 0.187 kg. 2.84 Type the data values into a Minitab worksheet. You can do (a) and (b) in one go with a fitted line plot. Price should be your response because it is the outcome: a bigger coffee costs more. In fact, you can do most of (d) and (e) from the fitted line plot too: after you select Regression and Fitted Line Plot, click on Graphs and click on the bottom box, “residuals vs. the variables”. Double-click on Size to get a plot of residuals against size. Click OK. Also, click on Storage and check the box next to Residuals. The residuals will be stored in the worksheet so you can look at them. My fitted line plot is in Figure 1 and my residual plot is in Figure 2. Figure 2: Plot of residuals vs. size partly because of the cost of the coffee. Per ounce, a larger drink is actually less expensive. So describing this relationship with a line will not tell the whole story. The regression line is ŷ = 2.257 + 0.08036x, where x is size and y is price. The slope is positive, so the trend is indeed upwards. The fitted line plot suggests that the residual for the 16-oz coffee is positive (above the line) and the other two are negative (below). The residuals themselves are under RESI1 in the Minitab worksheet: -0.07, 0.11, -0.04 which do indeed sum to 0. My residual plot is in Figure 2; it’s hard to say much with only three data points, but there appears to be a down-up-down curve in the residual plot, suggesting that the relationship is actually a curve. (With only three points, we could get this pattern just by chance; if we had, say, 10 size-price values and the pattern persisted, we could be more confident. Maybe one day Starbucks will introduce a 20-ounce “mezzo-Venti” and we’ll be able to tell then.) But with many things you buy, the more you buy, the smaller the cost per unit. Figure 1: Fitted line plot for price vs. size The relationship between size and price appears to be increasing 2.86 You can draw a scatterplot and mark the line on it by hand, or you can use Minitab’s own facility for drawing regression lines but not linear. In Exercise 2.3 I speculated that there is a fixed on scatterplots, called the Fitted Line Plot. Select Stat, Regrescost of serving you, so that a bigger drink is more expensive only 1 this curved pattern is a warning sign that the actual relationship is a curve, not a straight line (indeed, any time a residual plot shows a pattern, you have something to worry about). This is hardly a surprise here, but when the curve is harder to see, it will show up on the residual plot more noticeably than it will show up elsewhere. Figure 3: Fitted line plot for fuel usage data sion, Fitted Line Plot; select your response, fuel usage, and your explanatory variable, speed, and then click OK. My plot is in Figure 3. The relationship is curved, so trying to fit a line through it does a very poor job! Figure 4: Residual plot for 2.62 2.87 Again, we can do (a) and (b) at once by using a fitted line graph (or you can do them separately by drawing the scatterplot first and then figuring out where the regression line goes and drawing it on your graph). My plot is shown in Figure 5. While weight certainly increases with age, it doesn’t seem to do so in a straightline way: growth is rapid at the beginning, then it levels off, and then it grows faster again. The straight line doesn’t capture this pattern at all. Though we can fit a line here, it doesn’t make much sense to do so. Add the given residuals up to get −0.01, which is zero to within rounding error. You can either type the given residuals into a column and plot them against speed, or you can have Minitab calculate and plot the residuals. There’s nothing new in the first way (just a bit of typing). To do the second way involves a new step. First, select Stat, Regression and Regression, and select the response variable (fuel usage) and explanatory “predictor” variable (speed). Before you click OK, click on Graphs. Look for Residuals vs. the Variables and click on the text box underneath; select the explanatory variable Speed. You’ll get a plot like Figure 4 of residuals against speed with the zero line shown. The fact that the residuals show The residuals are given in the data set on the disk (and in the text), so you can just plot them against age and draw in the zero line. Or you can use the Graphs option of the Regression command, just as in 2.86. Select Stat, Regression, Regression; 2 select Weight as your response and Age as your predictor; click Graphs, and select Age into the bottom box to plot residuals against Age. You can verify that the fitted intercept and slope (in the Session window) are correct. My residual plot is in Figure 6. The clear curved pattern in the residual plot indicates that we should have fitted a curve in the first place (the curve is clearer in the residual plot than on the original scatterplot). Here we see that fitting a “good enough” line is not good enough! The residuals as printed in the text add up to 0.01, which is zero to within rounding error. (Adding up the residuals calculated by Minitab will give you an answer closer to 0 because Minitab keeps more than two decimal places.) 2.88 Put in 28 for x to get ŷ = 0.965 − 0.00045(28) = 0.9524. The residual uses the observed y-value, 0.99 − 0.9524 = 0.0376, close enough. Figure 5: Fitted line plot for weight vs. age To plot the residuals, first get the data for Exercise 2.80 into Minitab. You can then either type the residual values into an empty column and plot them against days stored, or you can run the regression in Minitab and ask for a residual plot. I did the latter: Stat, Regression, Regression; select fenthion as response and days as explanatory, then click on Graphs and ask for a plot of residuals against days (select days into the bottom box). My residual plot is in Figure 7. There is a suggestion of a curved pattern with the residuals in the middle tending to be negative and the ones at the ends tending to be more positive, but the pattern is very weak — for four of the five storage day values, two of the four residuals are positive and two negative, and for the middle value, one residual is positive and three are negative: not exactly an uneven split! So there is a very weak curve visible, and I do not think there is enough evidence to say that a curve should be fitted instead of a line. (The line, even though the slope is very close to 0, fits well: the R-squared is 86.7%). Figure 6: Residual plot for weight-age data 2.89 The work in 2.87 says that if you start from a known age, you 3 be confusing or even misleading. Figure 7: Residual plot for predicting fenthion from days stored can predict the mean weight for children of that age very accurately. But individual children will vary in weight quite a bit. For example, the mean weight of 4-month-old children is 6.3 kg, but individual children might be 6.0, 6.1, 6.5, 6.6 kg: the mean weight is 6.3, but none of the individual weights are close to that. If the Figure 8: Fitted line plot for gas chromatography data regression predicts the mean weight accurately, it doesn’t have to predict individual weights equally accurately, and so r could be a 2.94 You can do this question by getting a fitted line plot (to do (a) lot smaller. and (b)) and a plot of residuals against amount (for (c)). My plots are in Figures 8 and 9. 2.91 Mean SAT score depends on two things: year (ie. time), and high school grade. Things are further complicated by the fact that high school grade depends on time as well (that’s what “grade inflation” is). The statement “mean SAT score has increased over time” doesn’t account for grade at all, and if grade makes a difference, we could be misled. Think about a B student 4 years ago; if that same student graduated now, he or she might be an A student because of grade inflation, but his/her SAT score would not change very much. So the 2004 A students are better (in terms of SAT scores) than 2008 A students, because an A “is not worth as much as it used to be”. The fitted line plot appears to show that the data lie close to the straight line, but because the “response” values vary so much, it’s hard to see detail. The residual plot, however, shows deviations from a straight line much more clearly, because the residuals are on a much more “human” scale (they are all between −15 and 15). The residuals for amounts 0 and 20 are all positive and those for amounts 1 and 5 are all negative. That is, there is a curved pattern, up-down-up, and so the relationship is a curve rather than a straight line (this is very hard to tell from the original scatterplot). Also, as you go across the picture from left to right, the residuals appear to get more spread out. (If you go on to STAB27, you’ll learn that a The statistical moral is that any variables that might affect the response should be included in the analysis; leaving them out can 4 Figure 9: Residual plot for gas chromatography data Figure 10: Scatterplot of silicon-isotope data transformation of the response values is called for, and that using, say, the logarithms of the response values might work better.) Correlations (Pearson) Correlation of isotope and silicon = -0.339, P-Value = 0.281 2.96 My scatterplot is shown in Figure 10, with the outlier marked with an X. (To draw this, find the outlier in the data set — it’s the last, 12th, row of values — and create a new column with only an X in that row, by typing an X into the cell in the 12th row and 3rd column. Then set up a Scatterplot in the usual way, but before you click OK, click on Labels, select Data Labels, Use Labels from Column, and select the column with the X in it.) Regression Analysis The regression equation is silicon = - 493 - 33.8 isotope Figure 11: Correlation and regression with the outlier Ignoring the outlier, the trend is a moderate-to-strong negative association. Correlations (Pearson) Correlation of isotope and silicon = -0.787, P-Value = 0.004 To calculate the correlation and regression line with and without the outlier, first calculate both using all the points, then replace the value 337 in row 12 by * (for “missing”) and re-do the calculations. My results are shown in Figures 11 and 12. I also used the Storage option on Regression to save both sets of Fits (fitted values), for my plot in (c). The correlation changed from −0.339 to −0.787, a pretty substantial change considering that we only Regression Analysis The regression equation is silicon = - 1372 - 75.5 isotope Figure 12: Correlation and regression without the outlier 5 the pairs of points by lines drawn by hand. Finally, think about why we constructed this graph in the first place. Including or excluding the outlier makes a substantial difference to where the regression line goes; the red and green lines on the graph are quite different. 2.97 In 2.45, we found that the Insight lies on the trend, and including it makes the correlation higher. The scatterplot, with fitted line superimposed, is shown in Figure 14. (This is a Fitted Line plot.) The regression line is y = 6.78 + 1.01x where x is city mileage and y is highway mileage. Figure 13: Scatterplot of silicon-isotope data with two regression lines removed one observation. This shows that the correlation can be strongly influenced by points not on the trend. Similarly, the intercept and slope of the regression line both change substantially (in particular, the slope becomes more negative). Since I saved both sets of Fitted Values, now I can superimpose the fitted lines on the plot. This can be done by drawing three plots and superimposing them on one graph. Call for the three plots by putting, under Y and X in the dialog box, first Silicon and Isotope, second Fits1 (the fits from the 1st regression) and Isotope, third Fits2 (the fits from the 2nd regression) and Isotope. To get these all on one graph, click on Multiple Graphs, select Multiple Variables, and click on Overlaid on the Same Graph. There are probably more elegant ways of doing it, but this one worked. I got some points (in black) that are the scatterplot, some points (in red) that are the regression line with the outlier included, and some points (in green) that are the regression line with the outlier excluded. As you see in Figure 13, the second regression line has a much more negative slope. Figure 14: Fitted line plot for gas mileage data You can make the residual plot in two stages by storing the residuals from the regression and then plotting them, or you can do it in one go by using the Graphs option in Regression. Since we actually want to look at the residuals this time, the first way is the way to go. Select Stat, Regression and Regression, and select the response and explanatory variables. Click on Storage, then If you don’t want to go to all this trouble, you can figure out where the regression lines go (eg. by finding the value of Silicon when Isotope is −21.5 and −19.5 for each line, and then joining 6 very close to the line (and r 2 = 99.8% is very high). So a straight line does a very good job of describing the data. The regression equation (from the plot) is y = 1.77 + 0.08x, where x is speed and y is stride rate. (8.03E − 2 is 8.03 × 10−2 , or 0.0803.) Figure 15: Residual plot for gas mileage data click the box next to Residuals. Click OK twice. You’ll see a new column of data in the worksheet (bottom half of the screen) called RESI1, containing the residuals. You can scan down this column to find the biggest positive value, 1.99 in row 25, and the biggest negative value, −2.89 in row 12. These are the BMW 330CI and the Lamborghini Murcielago. My residual plot is in Figure 15. The strange near-horizontal lines are because the gas mileages were only measured to the nearest whole number. Figure 16: Fitted line plot for stride rate data For (c), we need to look at the residuals, so we can obtain them first. Select Stat, Regression and Regression; select Stride Rate as the response and Speed as explanatory. Before you click OK, click on Storage, select Fits and Residuals (by clicking on the boxes next to the words), then click OK twice. The residuals appear as an extra column in the worksheet, marked RESI1, as do the fitted values, FITS1. They are, to the accuracy shown, 0.011, −0.001, −0.001, −0.011, −0.009, 0.003, 0.009, which add up to zero to within rounding (actually 0.001). You can see that the residuals are very small compared to the stride rate values. Then simply make a plot of the residuals against speed, as shown in Figure 17. The residuals form a curved pattern (down and then up), which shows that fitting a curved relationship would do The Insight isn’t an outlier for two reasons: it is more or less on the line, so its residual should be close to zero, and because the Insight’s x and y values are so far from the other cars, it is influential, pulling the line closer to itself than the line would otherwise go. You can try fitting the line with and without the Honda Insight to see what the effect is. 2.98 This question is small enough to do by hand. Or, if you want to use Minitab (like me), looking carefully at (a) and (b) reveals that a fitted line plot will again do the job. My fitted line plot is in Figure 16, from which you see that the observed values are 7 (even) better at predicting stride rate from speed. (But the fit of the line is so good that we may be happy with that.) The scatterplot doesn’t show any observations to be influential, because none of the values are a long way from the rest. (Taking out one point wouldn’t change the line very much.) Figure 18: Income-education scatterplot to those with less education. (The correlation in my picture is −0.813.) Salary depends on both education and location of employment, so to consider the influence of only one of those on salary is missing the whole picture. Figure 17: Residual plot for stride rate data 2.104 Note that the values of x are the same for the first three data sets, so this set of x-values appears only once in the data set on the disk. 2.103 Think about businesses first: while having more education is associated with a higher salary, most people employed by business have only 3–5 years of education (a bachelor’s degree) and a relatively high salary. Imagine 3 people: they might have 3 years of education and a salary of $69,000, 4 years and $70,000, 5 years and $71,000. Within this group, more education means a higher salary. The four correlations are all 0.816, and all four regressions are ŷ = 3 + 0.5x. (b) can be answered by drawing fitted line plots in the four cases. See Figures 19, 20, 21 and 22. Data set 1 looks very reasonable to fit with a regression line. Data set 2 is a clear curve. Data set 3 has an outlier in the y-direction (the second-last point), and in data set 4, the whole strength of the relationship is formed by the influential point (x-outlier) on the right. So only for data set 1 does it make sense to use a regression line, and doing so in the other three cases would be misleading. Now turn to academia: economists there tend to have a PhD, but earn less money than industry-employed economists do. To make up some numbers: 6 years of education and $48,000; 7 years and $50,000; 8 years and $52,000. (Your numbers may differ; it’s the pattern that’s important.) My scatterplot is shown in Figure 18. Within each group, more education means more income, but overall the higher salaries go If you think about it, it is really quite a feat to dream up four 8 Figure 19: Data set 1 Figure 21: Data set 3 Figure 20: Data set 2 Figure 22: Data set 4 9 data sets with the same correlations, intercepts and slopes, and yet quite differently appropriate regressions. 10
© Copyright 2026 Paperzz