Statistics for everyone: ch.12 Linear regression

easier R than SPSS with Rcmdr : Contents

ch.12 Linear regression

Data is ‘swiss’.

We examined neonatal mortality rates in 47 provinces and examined other variables that may be related to neonatal mortality. (This data is with column names at the top of the data and row names on the left. )

If the dependent variable is a continuation variable, you can use linear regression.

The variable that can affect or explain this dependent variable is called an explanatory variable, so let’s choose just one for now. The simplest way is to choose two variables. How does the Agriculture affect neonatal mortality?

First, we show the diagnostic plots.

The first ‘Residuals vs Fitted’ draws the values expected by regression on the x-axis and the residuals on the y-axis, showing the normality, isovariance, and linearity of the residuals.

The second ‘Normal Q-Q’ shows the normality of residuals. The closer it is to a straight line, the more it determines that it forms a normal distribution.

The third ‘Scale-Location’ draws the prediction on the x-axis and the standardized residuals on the y-axis, similar to the first figure.

The fourth ‘Residuals vs Leverage’ draws the Leverage on the x-axis. Cook’s Distance, a red dotted line, shows the point of influence. A point that has a large impact on regression analysis is called an ‘Influential Point’.

Linear regression requires multiple assumptions, and it is the dignostic plots that check to see if those assumptions are valid.

When there are only two variables, a scatterplot with a straight line is suitable. There doesn’t seem to be much to do with the two.

The slope of this straight line is ‘Estimate’. The slope is -0.0078, which is almost zero, and the value of p is also large. So you can see by numbers and graphs that Agriculture and Infant.Mortality have little relationship. However, when there are multiple variables, it is not easy to figure them out by graph.

Now let’s put all the remaining variables in the explanatory variables.

The same diagnostic plot appeared. The same can be done with the interpretation.

Since the bottom p < 0.05, this regression expression is once determined to be statistically significant, of which the variable Fertility, which has a **, is statistically significant. An increase in fertility of 1 increases infant.mortality by 0.15097102.

As was the case with the name of the diagnostic plot earlier , and now at the top of this result you will see the following formula form:

‘Result variable ~ explanator1 + explanator2 + explanator3 + ......’

Formulas like this come up often in the future, so it’s good to get used to them.

Each Estimate is a slope, which shows how much the result variable changes as the explanatory variable increases by 1. Catholic is close to zero, around 0.000067, and Fetilility is on the big side at 0.15.

Let’s check and run ‘Stepwise selection~’ at the bottom.

Initially, it created an expression with all the variables, and then it used AIC to subtract one variable and execute it.

Continue to subtract them one by one.

In the end, only two variables remain, Education and Fertility.

Akaike’s Information Criteria (AIC) is named after the Japanese statistician Hirotugu Akaike, who gives you a penalty if you add one more variable. The lower the AIC, the better the model. Bayesian information criteria (BIC) are also variants of AIC, which gives a larger penalty when you add a variable. In addition, AICc is also a variant of AIC, taking into account the presence of a small number of samples. Mallows Cp is also a variation of AIC, created by Colin Mallows. We will be using representative AIC and BIC here.

Now I use BIC to remove them one by one in the same way. There is only one ‘Fertility’ left.

It then removes a variable by using p value, and removes the one with the largest value of p.

In the end, only ‘Fertility’ remains.

We use the appropriate one of the three methods I have introduced, or we can include variables that we think are clinically important to determine the most appropriate final model.

easier R than SPSS with Rcmdr : Contents

=================================================