Diagnostic-tests

Introduction

For the analysis of multiple linear regression models statisticians apply Gauss-Markov theorem for the estimators of the regression to be best linear unbiased estimators (BLUE). The theorem includes 5 assumptions about heteroskedasticity, linearity, exogeneity, random sampling and non-collinearity.

AFR provides:

2 tests for detecting heteroscedasticity:

Breusch-Pagan Test
Goldfeld-Quandt Test

3 tests for detecting multicollinearity and autocorrelation:

VIF test
Durbin Watson Test
Breusch-Godfrey Test

4 tests for detecting normality:

Shapiro-Wilk test
Kolmogorov-Smirnov test
Cramer-Von Mises test
Anderson test

Heteroskedasticity

One of the assumptions made about residuals/errors in OLS regression is that the errors have the same but unknown variance. This is known as constant variance or homoscedasticity. When this assumption is violated, the problem is known as heteroscedasticity.Heteroskedasticity is one of 5 Gauss-Markov assumptions. It is tested by Breusch-Pagan and Goldfeld-Quandt tests.

Breusch-Pagan Test

Breusch Pagan Test was introduced by Trevor Breusch and Adrian Pagan in 1979. It is used to test for heteroskedasticity in a linear regression model and assumes that the error terms are normally distributed. It tests whether the variance of the errors from a regression is dependent on the values of the independent variables. Null hypothesis states that error variances are constant.

model <- lm(real_gdp ~ imp + exp + poil + eurkzt,macroKZ)
bp(model)
#> Homoskedasticity presents. Please use other tests additionally.In case of opposite results study the case further.
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  
#> BP = 7.1366, p-value = 0.1288

Goldfeld-Quandt Test

The Goldfeld Quandt Test is a test used in regression analysis to test for homoscedasticity. It compares variances of two subgroups; one set of high values and one set of low values. If the variances differ, the test rejects the null hypothesis that the variances of the errors are not constant.

model <- lm(real_gdp ~ imp + exp+poil+eurkzt, macroKZ)
gq(model)
#> Heteroskedasticity presents. Please use others tests additionally.In case of opposite results study the case further.
#> 
#>  Goldfeld-Quandt test
#> 
#> data:  
#> GQ = 2.4832, p-value = 0.04853

Non-collinearity

Multiple regression assumes that the independent variables are not highly correlated with each other. This assumption is tested using Variance Inflation Factor (VIF) values and by Durbin-Watson and Breusch-Godfrey tests for autocorrelation.

VIF Test

The VIF of the linear regression is defined as VIF = 1/T. With VIF > 5 there is an indication that multicollinearity may be present; with VIF > 10 there is certainly multicollinearity among the variables.

model <- lm(real_gdp ~ imp + exp + poil + eurkzt,macroKZ)
vif_reg(model)
#>       imp       exp      poil    eurkzt 
#>  2.210820 11.889882 12.805468  1.839555 
#> This value 11.889882363203 exceeds acceptable threshold
#> This value 12.8054676116014 exceeds acceptable threshold

Durbin-Watson Test

The Durbin Watson (DW) statistic is a test for autocorrelation in the residuals from a statistical model or regression analysis. The Durbin-Watson statistic will always have a value ranging between 0 and 4. A value of 2.0 indicates there is no autocorrelation detected in the sample. Values from 0 to less than 2 point to positive autocorrelation and values from 2 to 4 means negative autocorrelation.

model <- lm(real_gdp ~ imp + exp + poil + eurkzt,macroKZ)
dwtest(model)
#> 
#>  Durbin-Watson test
#> 
#> data:  model
#> DW = 2.2522, p-value = 0.6468
#> alternative hypothesis: true autocorrelation is greater than 0

Breusch-Godfrey Test

Alternatively, there is Breusch-Godfrey Test for autocorrelation check.It tests for the presence of serial correlation that has not been included in a proposed model structure and which, if present, would mean that incorrect conclusions would be drawn from other tests or that sub-optimal estimates of model parameters would be obtained.Null hypothesis states that there is no autocorrelation.

model <- lm(real_gdp ~ imp + exp + poil + eurkzt,macroKZ)
bg(model)
#> Residuals are not autocorrelated
#> 
#>  Breusch-Godfrey test for serial correlation of order up to 1
#> 
#> data:  
#> LM test = 1.8218, p-value = 0.1771

Normality

Normality refers to a specific statistical distribution called a normal distribution, or sometimes the Gaussian distribution or bell-shaped curve. The normal distribution is a symmetrical continuous distribution defined by the mean and standard deviation of the data.

In AFR package 4 normality tests are compiled in one ols_test_normality function from olsrr package.

model <- lm(real_gdp ~ imp + exp + poil + eurkzt,macroKZ)
ols_test_normality(model)
#> -----------------------------------------------
#>        Test             Statistic       pvalue  
#> -----------------------------------------------
#> Shapiro-Wilk              0.9607         0.1886 
#> Kolmogorov-Smirnov        0.1388         0.4037 
#> Cramer-von Mises          3.7692         0.0000 
#> Anderson-Darling          0.5918         0.1167 
#> -----------------------------------------------

Shapiro-Wilk statistic

The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. On the other hand, if the p value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) can not be rejected.

Kolmogorov-Smirnov statistic

The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples. The null distribution of this statistic is calculated under the null hypothesis that the sample is drawn from the reference distribution (in the one-sample case) or that the samples are drawn from the same distribution (in the two-sample case).Since the p-value is less than .05, we reject the null hypothesis.

Cramer-Von Mises test

Alternative to Kolmogorov-Smirnov test, Cramer-von Mises statistic is a measure of the mean squared difference between the empirical and hypothetical cumulative distribution functions. It is also used as a part of other algorithms, such as minimum distance estimation.The Cramér–von Mises test can be seen to be distribution-free if empirical distribution is continuous and the sample has no ties. Otherwise, statistic is not the true asymptotic distribution.

Anderson test

The Anderson-Darling test is used to test if a sample of data comes from a population with a specific distribution.The null hypothesis is that your data is not different from normal. Your alternate or alternative hypothesis is that your data is different from normal. You will make your decision about whether to reject or not reject the null based on your p-value.

For additional information please address:

Wooldridge, Jeffrey M. 2012. Introductory Econometrics: A Modern Approach, Fifth Edition.

Hyndman, Rob J and George Athanasopoulos. 2018. Forecasting: Principles and Practice, 2nd Edition.