Statistics for everyone: ch.26 Cluster analysis

easier R than SPSS with Rcmdr : Contents

ch.26 Cluster analysis

Before we start clustering, let’s reload the States data in the package carData to put the data in its first state.

Let’s choose K-means cluster.

Select all the variables in the example.

In Options, also select as above and click ‘OK’.

Since we chose ‘Assign ~’, we have a new variable at the far right of the data. It contains 1 and 2. In other words, it is divided into 2 clusters.

As a result , cluster$size shows that cluster 1 contains 8 and cluster 2 contains 43.

The center coordinates of each cluster are shown in cluster$centers.

Let’s simplify the task and redo it to understand a little more about the center of the cluster.

Now select only 2 variables and click ‘OK’.

With the new classification, 1 and 2 were created in Kmeans.

Re-select the 2 variables that you selected a while ago and click ‘Plot by groups’.

You then specify the Kmeans variable. The plot created by this is shown below.

A scatterplot expressed in 2 colors.

The calculated results show that there are 27 blue points and 24 pink points.

If you take a glance at the coordinates of the calculated center, you can guess that it is also the center of the points. As you might guess from the name K-means, the average of the x and y components of each point becomes the component of center.

If you take 3 variables, you will classify the points in the 3-D space of xyz in a suitable grouping.

After exporting the data to a csv file,

When you call it in Excel, the row name will be in column A. Therefore, you must move column 1 by one space to the right.

Save it leaving only the 3 variables SATM, percent, dollars, and KMeans that you chose in the first place.

If you upload this file to a https://tinyurl.com/3D-Scatter-plot3 where you can draw a 3D scatterplot, you can guess what it looks like. You can also guess the location of the center with 3 coordinates.

However, if you have more than 4 variables, they are mathematically calculated in the same way, but you can’t draw an image.

Let’s do ‘Hierarchical cluster’. You can better understand ‘Hierarchical’ by analyzing.

Select a variable.

There are a variety of options, and their consequences are expressed in Dendrograms. Because there are so many options, we default to them first, and then choose another one if necessary.

The Dendrogram shows a lot of information. ‘Dendro’ means twig or brach , which is easy to understand if you think of it as a tree branch or a river.

If you want to divide the whole into 2 groups, you can do it with the left, and if you want to divide it into 3 groups, you can do it with the right. Like a branch, the further downwards, that is, the more the branches stretch, the more they are divided. Then the things that are side by side are almost attached to the ends of the branches, making them more difficult to distinguish.

If you look at this picture of the branches, it’s easy to see why it’s called ‘hierarchical’. This is because it is hierarchically classified, as if it were dividing the world into many nationa, with many States under a nation.

You can also try a different way to measure distance.

This will then show you the different Dendrograms.

With different Method, different Dendrogram will be drawn. That is, different Method cand different way to measure distance will make different Dendrogram each.

To summarize the results, select the menu above.

This will bring up more detailed options.

It shows a little more information.

As a result of this menu, one more column is added to the dataset. You can download this dataset and use it for future work or analysis.

easier R than SPSS with Rcmdr : Contents

=================================================