Statistics for everyone: ch.24 Principal Component Analysis (PCA)

easier R than SPSS with Rcmdr : Contents

ch.24 Principal Component Analysis (PCA)

Let’s practice by calling the States data from the package carData.

The data for each U.S. state is organized. Here’s a rough look:

● SATV: Language scores for high school students

● SATM: Math scores for high school students

● Percent: High school graduation rate

● Dollars: Educational fees used for students

● Pay: Teacher’s wages

Let’s select ‘Principal Component Analysis’ of ‘Original menu’.

Select all variables.

Let’s choose all the ‘Options’ as well. The last option, ‘Add principal~’, determines the following:

The number of components to add is determined.

The result is the addition of PC1 and PC2 in the data.

For each ingredient, you can see its features. The second result above, the Component variances, is graphed as follows:

If you look at the ‘screeplot’ above, there are significantly fewer 2 and 3 ingredients compared to the 1st component. When you think of scree, which means mountainside, from Comp. 2 onwards, it corresponds to a flat, so it is thought to be the main component only up to Comp.1. If you judge Comp.4 to a start of flatland, then Comp.3 is considered the main component.

If you look at the ‘Proportion of Variance’ of the ‘Importance of components’, you can see that

Comp1 is the largest at 0.6516807, Comp.2 is followed by 0.1560831, and

Comp.3 is also around 0.1485106.

In other words, the 1st main component accounts for almost 65%.

Meanwhile, to draw a biplot, copy and paste a portion of the script verbatim, and below it biplot(. PC). Select these two lines and click ‘Submit’.

Alternatively, you can shorten it to a single line like this:

You can then visually see how side by side each variable is with Comp1 or side-by-side with Comp2.

Using the standardization menu as above,

After standardizing all the original scores,

If you do the principal component analysis again, you will get the same result. Therefore, this principal component analysis is a pre-standardization automatically performed to do the principal component analysis, which is typically the case.

easier R than SPSS with Rcmdr : Contents

=================================================