OUTLIERS, HIGH LEVERAGE POINTS, AND INFLUENTIAL OBSERVATIONS (2022)

Next, we discuss the role of three types of observations that may or may not exert undue influence on the regression results: (1) outliers, (2) high leverage points, and (3) influential observations. Anoutlieris an observation that has a very large standardized residual in absolute value. Consider the scatter plot of nutritional rating against sugars in Figure 2.3. The two observations with the largest absolute residuals are identified as All-Bran Extra Fiber and 100% Bran. Note that the vertical distance away from the regression line (indicated by the vertical arrows) is greater for these two observations than for any other cereals, indicating the largest residuals. For example, the nutritional rating for All-Bran Extra Fiber (93.7) is much higher than predicted (59.44) based on its sugar content alone (0 grams). Similarly, the nutritional rating for 100% Bran (68.4) is much higher than would have been estimated (44.93) based on its sugar content alone (6 grams).

Residuals may have different variances, so that it is preferable to use the stan-dardized residuals in order to identify outliers. Stanstan-dardized residuals are residuals divided by their standard error, so that they are all on the same scale. Letsi,residdenote the standard error of theith residual. Then

si,resid =s 1−hi

wherehi refers to theleverageof theith observation (see below). The standardized residual,

residuali,standardized= yiyˆi

si,resid

A rough rule of thumb is to flag observations whose standardized residuals exceed 2 in absolute value as being outliers. For example, note from Table 2.7 that Minitab identifies observations 1 and 4 as outliers based on their large standardized

0 5 10 15

Outlier:All Bran Extra Fiber Sugars: 0

Figure 2.3 Identifying the outliers in regression ofnutritional ratingonsugars.

OUTLIERS, HIGH LEVERAGE POINTS, AND INFLUENTIAL OBSERVATIONS 49 residuals; these are All-Bran Extra Fiber and 100% Bran. In general, if the residual is positive, we may say that they-value observed ishigherthan the regression estimated given thex-value. If the residual isnegative, we may say that they-value observed is lowerthan the regression estimated given thex-value.

Ahigh leverage pointis an observation that is extreme in the predictor space.

In other words, a high leverage point takes on extreme values for thex-variable(s), without reference to the y-variable. That is, leverage takes into account only the x-variables and ignores they-variable. The termleverageis derived from the physics concept of the lever, which Archimedes asserted could move the Earth itself if only it were long enough. The leveragehifor theith observation may be denoted as follows:

hi = 1

n + (xix)¯ 2 (xix)¯ 2 For a given data set, the quantities 1/n and

(xix)¯ 2 may be considered to be constants, so that the leverage for theith observation depends solely on (xix)¯ 2,the squared distance between the value of the predictor and the mean value of the predictor.

The farther the observation differs from the mean of the observations in thex-space, the greater the leverage. The lower bound on leverage values is 1/n,and the upper bound is 1.0. An observation with leverage greater than about 2 (m+1)/nor 3 (m+1)/n may be considered to have high leverage (wheremindicates the number of predictors).

For example, in the orienteering example, suppose that there was a new obser-vation, a real hard-core orienteering competitor, who hiked for 16 hours and traveled 39 kilometers. Figure 2.4 shows the scatter plot, updated with this eleventh hiker.

Note from Figure 2.4 that the time traveled by the new hiker (16 hours) is extreme in thex-space, as indicated by the horizontal arrows. This is sufficient to identify this observation as a high leverage point without reference to how many kilometers he or she actually traveled. Examine Table 2.8, which shows the updated regression results for the 11 hikers. Note that Minitab points out correctly that this is an unusual observation. It is unusual because it is a high leverage point. However, Minitab is not,

0 2 4 6 8 10 12 14 16

10 15 20 25 30 35 40

Time

(Video) Leverage and Influential Points in Simple Linear Regression

Distance

Figure 2.4 Scatter plot of distance versus time, with new competitor who hiked for 16 hours.

SPHJWDD006-02 JWDD006-Larose November 23, 2005 14:50 Char Count= 0

50 CHAPTER 2 REGRESSION MODELING

TABLE 2.8 Updated Regression Results Including the 16-Hour Hiker The regression equation is

distance = 5.73 + 2.06 time

Predictor Coef SE Coef T P Constant 5.7251 0.6513 8.79 0.000 time 2.06098 0.09128 22.58 0.000 S = 1.16901 R-Sq = 98.3% R-Sq(adj) = 98.1%

Analysis of Variance

Source DF SS MS F P

Regression 1 696.61 696.61 509.74 0.000 Residual Error 9 12.30 1.37

Total 10 708.91

Unusual Observations

Obs time distance Fit SE Fit Residual St Resid 11 16.0 39.000 38.701 0.979 0.299 0.47 X

X denotes an observation whose X value gives it large influence.

The hard-core orienteering competitor is a high-leverage point. (Courtesy: Chantal Larose).

strictly speaking, correct to call it an observation with large influence. To see what we mean by this, let’s next discuss what it means to be an influential observation.

In the context of history, what does it mean to be an influential person? Aperson is influential if his or her presence or absence changes the history of the worldsignificantly. In the context of Bedford Falls (It’s a Wonderful Life), George Baileydiscovers that he really was influential when an angel shows him how different (and

OUTLIERS, HIGH LEVERAGE POINTS, AND INFLUENTIAL OBSERVATIONS 51

TABLE 2.9 Regression Results Including the Person Who Hiked 20 Kilometers in 5 Hours The regression equation is

distance = 6.36 + 2.00 time

Predictor Coef SE Coef T P Constant 6.364 1.278 4.98 0.001 time 2.0000 0.2337 8.56 0.000 S = 1.71741 R-Sq = 89.1% R-Sq(adj) = 87.8%

Analysis of Variance

(Video) Chapter 9 - outliers, leverage and influential points

Source DF SS MS F P

Regression 1 216.00 216.00 73.23 0.000 Residual Error 9 26.55 2.95

Total 10 242.55

Unusual Observations

Obs time distance Fit SE Fit Residual St Resid 11 5.00 20.000 16.364 0.518 3.636 2.22R

R denotes an observation with a large standardized residual.

poorer) the world would have been had he never been born. Similarly, in regression, an observation isinfluentialif the regression parameters alter significantly based on the presence or absence of the observation in the data set.

An outlier may or may not be influential. Similarly, a high leverage point may or may not be influential. Usually, influential observations combine the characteristics of a large residual and high leverage. It is possible for an observation to be not quite flagged as an outlier and not quite flagged as a high leverage point, but still be influential through the combination of the two characteristics.

First, let’s consider an example of an observation that is an outlier but is not influential. Suppose that we replace our eleventh observation (no more hard-core guy) with someone who hiked 20 kilometers in 5 hours. Examine Table 2.9, which presents the regression results for these 11 hikers. Note from Table 2.9 that the new observation is flagged as an outlier (unusual observation with large standardized residual). This is because the distance traveled (20 kilometers) is higher than the regression predicted (16.364 kilometers) given the time (5 hours). Now would we consider this observation to be influential? Overall, probably not. Compare Tables 2.9 and 2.6 to assess the effect the presence of this new observation has on the regression coefficients. They-intercept changes fromb0=6.00 tob0=6.36, but the slope does not change at all, remaining atb1=2.00 regardless of the presence of the new hiker.

Figure 2.5 shows the relatively mild effect that this outlier has on the estimated regression line, shifting it vertically a small amount without affecting the slope at all. Although it is an outlier, this observation is not influential because it has very low leverage, being situated exactly on the mean of thex-values, so that it has the minimum possible leverage for a data set of sizen=11. We can calculate the leverage for this observation (x=5,y=20) as follows. Since ¯x=5,we have

(xix)¯ 2=(2−5)2+(2−5)2+(3−5)2+ · · · +(9−5)2+(5−5)2=54

SPHJWDD006-02 JWDD006-Larose November 23, 2005 14:50 Char Count= 0

52 CHAPTER 2 REGRESSION MODELING

26

Figure 2.5 The mild outlier shifts the regression line only slightly.

Then

h(5,20)= 1

11+(5−5)2

54 =0.0909

Now that we have the leverage for this observation, we may also find the standardized residual, as follows. First, we have the standard error of the residual:

s(5,20),resid=1.71741√

1−0.0909=1.6375 so that the standardized residual,

(Video) Statistics 101: Linear Regression, Outliers and Influential Observations

residual(5,20),standardized= yiyˆi

s(5,20),resid

= 20−16.364 1.6375 =2.22 as shown in Table 2.9.

Cook’s distancemeasures the level of influence of an observation by taking into account both the size of the residual and the amount of leverage for that observation.

Cook’s distance takes the following form for theith observation:

Di = (yiyˆi)2 (m+1)s2

hi

(1−hi)2

whereyiyˆirepresents theith residual,mthe number of predictors,sthe standarderror of the estimate, andhithe leverage of theith observation. The left-hand ratio inthe formula for Cook’s distance contains an element representing the residual, and theright-hand ratio contains functions of the leverage. Thus, Cook’s distance combinesthe two concepts of outlier and leverage into a single measure of influence. The valueof the Cook’s distance measure for the hiker who traveled 20 kilometers in 5 hours isas follows:

OUTLIERS, HIGH LEVERAGE POINTS, AND INFLUENTIAL OBSERVATIONS 53 A rough rule of thumb for determining whether an observation is influential is if its Cook’s distance exceeds 1.0. More accurately, one may also compare the Cook’s distance against the percentiles of theF-distribution with (m,nm) degrees of freedom. If the observed value lies within the first quartile of this distribution (lower than the 25th percentile), the observation has little influence on the regression;

however, if the Cook’s distance is greater than the median of this distribution, the observation is influential. For this observation, the Cook’s distance of 0.2465 lies within the 37th percentile of theF1,10distribution, indicating that while the influence of the observation is not negligible, neither is the observation particularly influential.

What about the hard-core hiker we encountered earlier? Was that observation influential? Recall that this hiker traveled 39 kilometers in 16 hours, providing the eleventh observation in the results reported in Table 2.8. First, let’s find the leverage.

We haven =11 andm=1, so that observations havinghi >2 (m+1)/n=0.36 or hi >3 (m+1)/n=0.55 may be considered to have high leverage. This observation hashi =0.7007,which indicates that this durable hiker does indeed have high lever-age, as mentioned with reference to Figure 2.4. This figure seems to indicate that this hiker (x=16,y=39) is not, however, an outlier, since the observation lies near the regression line. The standardized residual supports this, having a value of 0.46801.

The reader will be asked to verify these values for leverage and standardized residual in the exercises.

Finally, the Cook’s distance for this observation is 0.2564, which is about the same as our previous example, indicating that the observation is not particularly influential, although not completely without influence on the regression coefficients.

Figure 2.6 shows the slight change in the regression with (solid line) and without (dashed line) this observation. So we have seen that an observation that is an outlier with low influence, or an observation that is a high leverage point with a small residual, may not be particularly influential. We next illustrate how a data point that has a moderately high residual and moderately high leverage may indeed be influential.

40

35

30

25

20

15

(Video) Influential points in regression | AP Statistics | Khan Academy

10

0 2 4 6 8 10 12 14 16

Time

Distance

Figure 2.6 Slight change in the regression line when the hard-core hiker is added.

SPHJWDD006-02 JWDD006-Larose November 23, 2005 14:50 Char Count= 0

54 CHAPTER 2 REGRESSION MODELING

TABLE 2.10 Regression Results from a New Observation withTime=10, Distance=23

The regression equation is distance = 6.70 + 1.82 time

Predictor Coef SE Coef T P Constant 6.6967 0.9718 6.89 0.000 time 1.8223 0.1604 11.36 0.000 S = 1.40469 R-Sq = 93.5% R-Sq(adj) = 92.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 254.79 254.79 129.13 0.000 Residual Error 9 17.76 1.97

Total 10 272.55

Suppose that our eleventh hiker had instead hiked for 10 hours and traveled 23 kilometers. The regression analysis for the 11 hikers is given in Table 2.10. Note that Minitab does not identify the new observation as either an outlier or a high leverage point. This is because, as the reader is asked to verify in the exercises, the leverage of this new hiker is hi =0.36019 and the standardized residual equals−1.70831.

However, despite lacking either a particularly large leverage or a large residual, this observation is nevertheless influential, as measured by its Cook’s distance of Di= 0.821457, which is in line with the 62nd percentile of the F1,10 distribution. The influence of this observation stems from the combination of its moderately large

26 24 22 20 18 16 14 12 10

1 2 3 4 5 6 7 8 9 10

Time

Distance

Figure 2.7 Moderate residual plus moderate leverage=influential observation.

FAQs

What is the difference between outliers influential points and high leverage points? ›

In short: An outlier is a data point whose response y does not follow the general trend of the rest of the data. A data point has high leverage if it has "extreme" predictor x values. With a single predictor, an extreme x value is simply one that is particularly high or low.

Can an outlier be an influential observation? ›

An outlier can either be influential or non-influential. If the outlier is an influential observation, then it has a big impact on the correlation coefficient, r, and on the least squares regression line. When there is a lot of data, the outlier tends NOT to be influential.

Does outlier have high leverage? ›

Thus, there is a distinction between outliers and high leverage observations, and each can impact our regression analyses differently. It is also possible for an observation to be both an outlier and have high leverage. Thus, it is important to know how to detect outliers and high leverage data points.

How do you tell if an outlier is an influential point? ›

An influential point is an outlier that greatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.

What is the difference between an outlier and an influential observation? ›

An outlier is a point with a large residual. An influential point is a point that has a large impact on the regression. Surprisingly, these are not the same thing. A point can be an outlier without being influential.

What is the difference between a leverage point and an influence point? ›

A leverage point is an observation that has an unusual predictor value (very different from the bulk of the observations). An influence point is an observation whose removal from the data set would cause a large change in the estimated reggression model coefficients.

How do you identify influential observations? ›

If the predictions are the same with or without the observation in question, then the observation has no influence on the regression model. If the predictions differ greatly when the observation is not included in the analysis, then the observation is influential.

What is high leverage points in regression? ›

Simply put, high leverage points in linear regression are those with extremely unusual independent variable values in either direction from the mean (large or small). Such points are noteworthy because they have the potential to exert considerable “pull”, or leverage, on the model's best-fit line.

How do you identify influential points? ›

To check on influential points, three possible methods you can use are scatter plots, partial plots, and Cook's distances. Simple scatterplots will display the values of each independent variable plotted against the dependent variable.

How do you know if you have high leverage? ›

You can compute the high leverage observation by looking at the ratio of number of parameters estimated in model and sample size. If an observation has a ratio greater than 2 -3 times the average ratio, then the observation considers as high-leverage points.

What does it mean to be highly leveraged? ›

Leverage is the amount of debt a company has in its mix of debt and equity (its capital structure). A company with more debt than average for its industry is said to be highly leveraged.

What is considered high leverage? ›

In general, a ratio of 3 and above represents a strong ability to pay off debt, although the threshold varies from one industry to another.

Should you always remove the outliers and the high leverage points from your dataset? ›

It's bad practice to remove data points simply to produce a better fitting model or statistically significant results. If the extreme value is a legitimate observation that is a natural part of the population you're studying, you should leave it in the dataset.

What is an influential point how should Influential points be treated? ›

Ans: An influential point is an outlier whose presence or absence has a large effect on the regression analysis. If the data have one or more influential​ points, perform the regression analysis with and without these points and comment on the differences.

What's the difference between leverage and outlier? ›

What's the difference between an outlier and a leverage point in ...

How are high leverage points calculated? ›

Leverage and Influential Points in Simple Linear Regression - YouTube

Is there evidence of outliers or high leverage observations in the model from E )? ›

Is there evidence of outliers or high leverage observations in the model from (e)? Based on the Normal. q-q pot and the Residuals vs Leverage plot, there are no evidence of such points.

Should you remove influential observations? ›

We say that these observations have the most influence or leverage. In practical terms, if an observation has a lot of leverage, then if you remove it, the coefficients will change noticeably. It is often helpful to identify the most influential observations.

What is the meaning of leverage points? ›

In systems thinking a leverage point is a place in a system's structure where a solution element can be applied. It's a low leverage point if a small amount of change force causes a small change in system behavior. It's a high leverage point if a small amount of change force causes a large change in system behavior.

Does an influential point have to have leverage? ›

Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point.

Do high leverage points affect slope? ›

Leverage Point (Non-Influential)

This point has high leverage point because it's far way from our original data horizontally. The leverage point hasn't affected our estimate of the slope because it follows the linear trend of the orignal data.

What is an influential observation in a linear regression setting? ›

In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation. In particular, in regression analysis an influential observation is one whose deletion has a large effect on the parameter estimates.

What do outliers on a scatter plot indicate? ›

An outlier for a scatter plot is the point or points that are farthest from the regression line. There is at least one outlier on a scatter plot in most cases, and there is usually only one outlier. Note that outliers for a scatter plot are very different from outliers for a boxplot.

What is outliers in regression analysis? ›

In regression analysis, an outlier is an observation for which the residual is large in magnitude compared to other observations in the data set. The detection of outliers and influential points is an important step of the regression analysis.

What does a high leverage point mean? ›

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables.

What is a high leverage? ›

When one refers to a company, property, or investment as "highly leveraged," it means that item has more debt than equity. The concept of leverage is used by both investors and companies. Investors use leverage to significantly increase the returns that can be provided on an investment.

Should you always remove the outliers and the high leverage points from your dataset? ›

It's bad practice to remove data points simply to produce a better fitting model or statistically significant results. If the extreme value is a legitimate observation that is a natural part of the population you're studying, you should leave it in the dataset.

Is there evidence of outliers or high leverage observations in the model from E )? ›

Is there evidence of outliers or high leverage observations in the model from (e)? Based on the Normal. q-q pot and the Residuals vs Leverage plot, there are no evidence of such points.

An observation could be unusual with respect to its y-value or x-value. However, rather than calling them x- or y-unusual observations, they are categorized as outlier, leverage, and influential points according to their impact on the regression model. Outlier – an outlier is defined by an unusual

Figure 7 shows both x-outlier (left) and y-outlier (right).. Leverage – a data point whose x-value (independent) is unusual, y-value follows the predicted regression line though (Figure 8).. Figure 9 shows the impact of an influential point on the regression statistics, including the r-square , slope , and the intercept .. Regression Diagnostic Analysis: Detection of Outliers. Regression Diagnostic Analysis: Detection of x-Outlier and Leverage Points. A leverage point is determined by a point whose x-value is an outlier, while the y -value is on the predicted line ( y -value is not an outlier).. Any point whose diagonal element of the hat matrix value exceeds 2 p/n (2*2/11=0.36 for this example) is considered a leverage point.. Therefore, the point # 11 is considered an x-outlier and it has high leverage on the regression analysis.. If the absolute value of DFIT exceeds 1 for small to medium data sets and for large data set, the point is considered as influential to the fitted regression.. In this small data set example in Table 4, the absolute value of DFIT for point # 11 is observed to be 3.63 which exceeds 1 (one), and therefore, the point #11 is considered an influential point.. A very large Cook’s distance for a point indicates a potential influence on the fitted regression line.. If the probability value for the Cook’s distance is 50% or more, the point has a major significant influence on the fitted regression line.

Enroll today at Penn State World Campus to earn an accredited degree or certificate in Statistics.

In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not affected by the inclusion of the red data point.. In summary, the red data point is not influential and does not have high leverage, but it is an outlier.. In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not affected by the inclusion of the red data point.. In summary, the red data point is not influential, nor is it an outlier, but it does have high leverage.. That's right — in this case, the red data point is most certainly an outlier and has high leverage!. And, in this case the red data point is influential.. In this case, the red data point is deemed both high leverage and an outlier, and it turned out to be influential too.

In this section, we learn the distinction between outliers and high leverage observations. In short:

An easy way to determine if the data point is influential is to find the best fitting line twice — once with the red data point included and once with the red data point excluded.. The solid line represents the estimated regression equation with the red data point included, while the dashed line represents the estimated regression equation with the red data point taken excluded.. The solid line represents the estimated regression equation with the red data point included, while the dashed line represents the estimated regression equation with the red data point taken excluded.. The standard error of b 1 is about the same in each case — 0.172 when the red data point is included, and 0.200 when the red data point is excluded.. The solid line represents the estimated regression equation with the red data point included, while the dashed line represents the estimated regression equation with the red data point taken excluded.

In Section 14.8 we showed how residual analysis could be used to determine when viol­ations of assumptions about the regression model occur. In this section, we discuss how residual analysis can be used to identify observations that can be classified as outliers or as being especially influential in determining the estimated regression equation. Some steps

In this section, we discuss how residual analysis can be used to identify observations that can be classified as outliers or as being especially influential in determining the estimated regression equation.. Figure 14.16 is a scatter diagram for a data set that contains an outlier, a data point (obser­vation) that does not fit the trend shown by the remaining data.. Indeed, given the pattern of the rest of the data, we would expect y 4 to be much smaller and hence would identify the correspond­ing observation as an outlier.. If an observation devi­ates greatly from the pattern of the rest of the data (e.g., the outlier in Figure 14.16), the corresponding standardized residual will be large in absolute value.. For the data in Table 14.11, Figure 14.18 shows the output from a regression analysis, including the regression equation, the predicted values of y, the residuals, and the standardized residuals.. For example, suppose that in checking the data for the outlier in Table 14.11, we find an error; the correct value for observation 4 is x 4 = 3, y 4 = 30.. However, if the influential observation were dropped from the data set, the slope of the estimated regression line would change from negative to positive and the y-intercept would be smaller.. Clearly, this one observation is much more influential in determining the estimated regression line than any of the oth­ers; dropping one of the other observations from the data set would have little effect on the estimated regression equation.. An influential observation may be an outlier (an observation with a y value that deviates substantially from the trend), it may corres­pond to an x value far away from its mean (e.g., see Figure 14.20), or it may be caused by a combination of the two (a somewhat off-trend y value and a somewhat extreme x value).. The presence of the influential observation in Figure 14.20, if valid, would suggest trying to obtain data on intermediate values of x to understand better the relationship between x and y.. The influential observation in Figure 14.20 is a point with high leverage.. Many statistical packages automatically identify observations with high leverage as part of the standard regression output.. From Figure 14.21, a scatter diagram for the data set in Table 14.12, it is clear that observation 7 (x = 70, y = 100) is an observation with an extreme value of x.. Because h 7 = .94 > .86, we will identify observation 7 as an observation whose x value gives it large influence.

Detect and solve issues of outliers, leverage and influential observations with Python.

As a result, removing a leverage value from the dataset will have an impact on the OLS line.. As a result, just a few observations with high leverage may result in questionable model fit.. Let’s look at the Boston Housing dataset and see if we can find outliers, leverage values and influential observations.. In the next block, I wanted to show how to obtain studentized residuals, Cook’s Distances, DFFITS and leverage values one by one.. Now that we identified outliers, we need to see which observations can be considered to have leverage values.. Now that we identified some outliers and leverage values, let’s bring them together to identify observations with significant influence.. Indeed, when an observation is both an outlier and has high leverage, it will surely impact the regression line as a result of influencing regression coefficients.. Many people use three times the mean of Cook’s D as a cutoff for an observation deemed influential.. DFFITS is also designed to identify influential observations with a cutoff value of 2*sqrt(k/n).. Unlike Cook’s Distances, DFFITS can be both positive and negative, but a value close to 0 is desired as these values would have no influence on the OLS regression line.. DFITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence.. Cook’s Distances and DFFITS are general measures of influence, while DFBETAS are variable specific.. First, I removed the observations that were deemed influential and fitted an OLS model.. The model fit improved even further and the R squared value increased compared to the initial fit and the fit based on variables without influential observations.. The conclusion is that model fit can be improved by identifying and removing outliers, observations with high leverage and influential observations.

A previous article discusses how to interpret regression diagnostic plots that are produced by SAS regression procedures such as PROC REG.

The first uses a DATA step and a formula to identify influential observations.. These are not the only regression diagnostic plots that can identify influential observations:. The following DATA step adds a quadratic effect to the Sashelp.Thick data and also adds a variable that is used in a subsequent section to merge the data with the Cook's D and leverage statistics.. The traditional way is to use the OUTPUT statement in PROC REG to output the. statistics, then identify the observations by using the same cutoff values that are shown in the diagnostic plots.. For example, the following DATA step lists the observations whose Cook's D statistic exceeds the cutoff value 4/ n ≈ 0.053.. Many SAS programmers use ODS OUTPUT to save a table to a SAS data set, but the same technique enables you to save the data underlying any ODS graph.. Notice that the CookOut data set includes a variable named Observation, which you can use to merge the CookOut data and the original data.. From the structure of the CookOut data set, you can infer that the influential observations are those for which the CooksDLabel variable is nonmissing (excepting the fake observations at the top of the data).. Therefore, the following DATA step merges the output data sets and the original data.. In the same DATA step, you can create other useful variables, such as a binary variable that indicates which observations have a large Cook's D statistic:. Merge with the original data by using the Observation variable.. For the RSOut data set, the indicator variable is named RsByLevIndex, which has the value 1 for ordinary observations and the value 2, 3, or 4 for influential observations.. However, an alternative technique is to take advantage of the fact that SAS can create graphs that label the outliers and influential observations.. To visualize the noteworthy observations, you can merge the original data and the statistics, indicator variables, and label variables.

Identifying Influential Data Points and Improving Linear Regression Models Using the Statsmodels Package

These data points are known as influential points.. We note in the previous paragraph that influential data points affect the predictive power of linear regression models.. An observation can have a high residual and high leverage and may or may not be an influential point.. Now, that we have identified observations that have high residuals or outliers, then we can apply a criterion to determine observations with high leverage.. Some sources would agree that influential data points are both outliers and have high leverage.. And we can implement this for this exercise, but in reality, even just qualifying as an outlier or high leverage may be enough for an observation to be an influential point.. This plot will help us see which specific variables contribute significant influence to observations Now that we have information on the possible influential data points, let us remove them and try to improve the predictive capacity and fit of our models.. Model Adjusted for influential points identified by DFFIT (index 3) Let us compare our three models and check if our adjustments did improve the predictive capacity of the model.. For the dataset above, removing influential points using dffits resulted in the best fit among the models that we have generated.. For example, if the purpose of the model is the identification of those extreme, influential instances, say for example loan defaults, removing these points will make our model not learn what features lead to these influential instances.

With experimental data, you commonly have to deal with "outliers", that is, data points that behave differently than the rest of the data for some reason. These outliers can influence the analysis and thus the interpretation of the data. In this blog post, we will look at these outliers and …

A point can be none, one or both of these.. In this tutorial, we will use a data set based on an example in Field, Miles, and Field (2012) .. Now that we already have some suspicion about this particular data point, let's see if this point has a) leverage b) discrepancy and c) influence.. Luckily, you don't have to calculate all hat-values by hand, as R provides a convenient hatvalues function that can be called on any linear model.. How would influence plots look like for these data?

Model Analyst at The USAA.

Only when an observation has high leverage and is an outlier in terms of Y-value will it strongly influence. the regression line.. Outliers that fall horizontally away from the center of the cloud are called leverage points High leverage points that actually influence the slope of the regression line are called influential points In order to determine if a point is influential, visualize the regression line with and without the point.. A good leverage point is a point that is unusually large or small among the X values but is not a regression outlier.. Said another way, a bad leverage point is a regression outlier that has an X value that is an outlier among X values as well (it is relatively far removed from the regression line).. As you can observed the plot above, Nevada (28th observation) and Rhode Island (39th observation) are states that detected as potential outliers.. Unusual observations typically have large residuals but not necessarily so- high leverage observations can have small residuals because they pull the line towards them .. Unusual observations typically have large residuals but not necessarily so- high leverage observations can have small residuals because they pull the line towards them :. Cook’s distance , leverages , and Mahalanobis distance can be effective for finding influential cases when a single outlier exist , but can fail if there are two or more outliers.

INFERENCE IN REGRESSION - OUTLIERS, HIGH LEVERAGE POINTS, AND INFLUENTIAL OBSERVATIONS

However, are we sure that there is no linear relationship between the variables?. The prediction interval for a random value of the response variable given a. particular value of the predictor. On the other hand, if β 1takes on any conceivable value other than zero, a linear relationship of some kind. exists between the response and the predictor.. Much of our regression inference in. this chapter is based on this key idea: that the linear relationship between x and y depends on the value of β 1.. x, so regression inference about the slope β 1is based on this sampling distribution. of b 1.. where s is the standard error of the estimate, reported in the regression results.. The regression equation is. Rating = 59.4 - 2.42 Sugars. r Under “SE Coef” is found the value of sb1, the standard error of the slope.

Some suspicious points are often wrongly reduced to outliers. Other types along with their detection tools deserve attention.

High Leverage: With this high x-value, the red point might be in the long tail of the distribution.. High Leverage: As being the main influential point, a variation on the observed response affects extremely the predicted response => YES. An influential point appears to be both an outlier and a high leverage point.. For example, high leverage points are not simply extreme x-values but might be a singular combination of x-values that make these points far from other observations in the subspace of the explanatory variables.. Outliers are, by definition, high residuals points .. Studentized Residuals vs Predicted Observations with a threshold of 2 or 3 (in absolute value) to label outliers (Moreover, we should see a uniform distribution because, theoretically, the covariance between both the residual and the predicted response vector is null) Studentized Residuals vs Observation Id with also a threshold of 2 or 3.. A straightforward idea could be to remove each point, refit the model, and analyze discrepancies between the “full” model and the model without the removed point.. (β_(i) is the OLS estimation without the i-th observation)The (2) equality shows directly the link between influential points, high leverage points, and outliers.. It confirms what we previously said: if an observation is a high leverage outlier, it is certainly an influential point too.

CiteSeerX - Scientific documents that cite the following paper: Unmasking outliers and leverage points: a confirmation.

(Show Context) ...which are based on very robust parameter estimation have been developed for the purpose of identifying masked multiple outliers in regression models with independent errors (see, e.g., Atkinson 1986; =-=Fung 1993-=-; Hawkins and McLachlan 1997).. Similar techniques are not available in the case of spatially autocorrelated observations, however.. by. Marie Ng, Rand R. Wilcox, Marie Ng, Rand R. Wilcox. Numerous multivariate robust measures of location have been proposed and many have been found to be unsatisfactory in terms of their small-sample efficiency.. In cardiac imagings16 applications, researchers used the segmentation and the contour after the segmentation tosevaluate the image quality [29,30].. (Show Context) ...l observations.. Single-perturbation diagnostics can suffer from masking effects (see Riani and Atkinson (2001)).. To avoid the masking effect, Rousseeuw and van Zomeren (1990) proposed to compute distance, based on robust estimates of location and covariance, to detect outliers in a multivariate point cloud.. However, Fung (1993) pointed out that the high-breakdown robust estimation method, and the least median of squares and SENSITIVITY ANALYSIS OF NONGAUSSIANITY BY PROJECTION PURSUIT 1715 minimum volume ellipsoid methods proposed by Rousseeuw and van Zomeren (1990) tend to declare too many observations as extreme.. Outliers are observations that deviate from the factor model, not from the center of the data cloud.. Any analysis working directly with the normal distribution ML may not identify truly outlying cases due to a masking effect (=-=Fung 1993-=-; Rousseeuw and van Zomeren 1990), as illustrated in the previous section.. (Show Context) ...mallest volume or smallestscovariance determinant that encompasses at leastshalf of the data, and use the corresponding meansand covariance matrix to detect outliers.

In Section 3.4 we discussed

Recall from Section 3.4.2 that semistudentized residuals. \begin{align*}. e_{i}^{*} & =\frac{e_{i}}{\sqrt{MSE}}\qquad\qquad\qquad(3.3). \end{align*}. can be used to determine if an observation is an outlier.. With the hat matrix. \begin{align*}. {\bf H} & ={\bf X}\left({\bf X}^{\prime}{\bf X}\right)^{-1}{\bf X}^{\prime}\qquad(4.24). \end{align*}. we can now present the true variance of $e_i$:. \begin{align*}. Var\left[e_{i}\right] & =\sigma^{2}\left(1-h_{ii}\right)\qquad(5.3). \end{align*}. where $h_{ii}$ is the $i$th diagonal element of $\bf{H}$.. Thus the studentized residual is. \begin{align*}. r_{i} & =\frac{e_{i}}{\sqrt{MSE\left(1-h_{ii}\right)}}\qquad(5.5). \end{align*}. A rule of thumb is any $r_i$ greater than 3 in absolute value should be considered an outlier with respect to $Y$.. The estimated variance of the deleted residual $d_{i}$ is. \begin{align*}. s^{2}\left[d_{i}\right] & =\frac{MSE_{(i)}}{1-h_{ii}}\qquad(5.8). \end{align*}. where $MSE_{(i)}$ is the $MSE$ of the fit with $Y_{i}$ removed.. We can identify outliers with respect to the predictor variables with. the help of the hat matrix ${\bf H}$.. In multiple regression, we. still do not want to extrapolate but now we cannot just think in terms. of the range of each individual predictor variable.. The variable thigh. had a range of values from about 42 to about 58.5.. This is. done by using the vector of $X$ values you want to predict at, ${\bf X}_{new}$,. and then including it in the hat matrix as. \begin{align*}. h_{new,new} & ={\bf X}_{new}^{\prime}\left({\bf X}^{\prime}{\bf X}\right)^{-1}{\bf X}_{new}\qquad(5.13). \end{align*}. If the value of $h_{new,new}$ is much larger than the leverage values. in the data set, then it indicates extrapolation.. Once we have identified observations that are outlying with respect. to $Y$ or with respect to predictor variables, or both, we now want. to know just how much it affects the fitted regression model.. We will determine which are influential. by seeing what happens to the fitted values when that observation. is deleted as we did with the deleted residuals above.. We can see how much influence an observation has on a single fitted. value $\hat{Y}_{i}$ by examining. \begin{align*}. DFFITS_{i} & =\frac{\hat{Y}_{i}-\hat{Y}_{i\left(i\right)}}{\sqrt{MSE_{(i)}h_{ii}}}\qquad(5.14). \end{align*}. As before, a new regression fit is not needed to obtain $DFFITS_{i}$.. So an influential. observation can have a (1) large residual $e_{i}$ and a moderate. leverage $h_{ii}$, (2) a large leverage $h_{ii}$ and a moderate. residual $e_{i}$, (3) or both a large residual and large leverage.. Regardless, let's just fit the model to all three variables for this example.

Videos

1. Outlier, High Leverage and Influential Observation
(Bæ Ibr)
2. What's the difference between an outlier and a leverage point in regression?
(Phil Chan)
3. Multiple Linear Regression: Outlier, Leverage, and Influential Points
(BIOS 6611)
4. 5 2 4 Outliers and Influential Points
(R Backman)
5. Outliers and Influential Points | Statistics and Probability | Chegg Tutors
(Chegg)
6. SAS - Finding Outliers, Influence, and Leverage Points
(Krohn - Education)

You might also like

Latest Posts

Article information

Author: Frankie Dare

Last Updated: 08/08/2022

Views: 6088

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Frankie Dare

Birthday: 2000-01-27

Address: Suite 313 45115 Caridad Freeway, Port Barabaraville, MS 66713

Phone: +3769542039359

Job: Sales Manager

Hobby: Baton twirling, Stand-up comedy, Leather crafting, Rugby, tabletop games, Jigsaw puzzles, Air sports

Introduction: My name is Frankie Dare, I am a funny, beautiful, proud, fair, pleasant, cheerful, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.