Identifying Heteroscedasticity in Data Analysis – Part 1

  1. Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data.
        1. In this case, our regression line will be, Diabetes = β0 + β1 * Obesity + ε.
            1. where β0 is the intercept representing the predicted value of diabetes when obesity is zero.
            2. β1 is the change in diabetes for a one-unit change in obesity.
            3. ε (Residual) is the error term, representing the difference between the predicted and actual values of diabetes.
  2. The linear model for the above features i.e., diabetes and obesity, captures 14% of the total variation in the data.
  3. As we can see, from the above plot, the data points are clustered at the right end which indicates the presence of heteroscedasticity. This can be witnessed by plotting a residual plot. The residual plot is shown below.

  1. The most important thing that is seen in a residual plot is the variation in data points.
  2. As we can see, the spread or variance of the residuals is not constant across the range of predicted values.
  3. The variance of residuals seems to be clustered as the predicted values increase and we can also see the presence of outliers which are individual points that deviate significantly from the others.
  4. Heteroscedasticity can be problematic because it breaks one of the assumptions of the linear regression models, model estimates have large standard errors, reducing the precision of your coefficient estimates, etc.

To check for heteroscedasticity, we can use diagnostic tools and statistical tests, such as residual plots, the Breusch-Pagan test, the White test, or the Goldfeld-Quandt test. I will post them in my next post.

Thank you !!!

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top
Skip to toolbar