**Case Study – k-Means**

*k*-Means clustering algorithm creates various partitions and then evaluate them by using the concept of minimizing the *within-cluster sum of squares (WCSS)*. The input to this algorithm is the number of data objects present in the dataset (say *n*) and the number of clusters (say *k*) to be specified by the user. The output of the algorithm is the set of *k *clusters resulting in high intra-cluster similarity but low inter-cluster similarity.

**Dataset Description**

For implementation of *k*-Means algorithm we will use the Iris dataset of UCI. This dataset contains 3 classes of 50 instances each and each class refers to a type of Iris plant. The dataset has four input features: sepal length, sepal width, petal length, and petal width (all the units are in cm). The fifth column is for class label, which holds the species information for these types of plants. As because we are implementing *k*-Means clustering algorithm, we can understand that the class label column is not needed for this experiment.

*k*-Means Implementation

*k*-Means Implementation

In this section, we will use the Iris dataset to demonstrate the *k*-Means clustering algorithm based on the four input features: sepal length, sepal width, petal length, and petal width. We will follow here step-by-step approach (with code and output of intermediate step) to demonstrate this algorithm. The Python code statements are run using IPython console.

**Step 1: Import the libraries**

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

**Step 2: Load the dataset and display statistics**

df = pd.read_csv('iris.csv')

print(df.shape) # shape

print(df.head(10)) # first 10 records

print(df.describe()) # descriptions

Output:

(150, 5)

sepal length sepal width ... petal width class

0 5.1 3.5 ... 0.2 Iris-setosa

1 4.9 3.0 ... 0.2 Iris-setosa

2 4.7 3.2 ... 0.2 Iris-setosa

3 4.6 3.1 ... 0.2 Iris-setosa

4 5.0 3.6 ... 0.2 Iris-setosa

5 5.4 3.9 ... 0.4 Iris-setosa

6 4.6 3.4 ... 0.3 Iris-setosa

7 5.0 3.4 ... 0.2 Iris-setosa

8 4.4 2.9 ... 0.2 Iris-setosa

9 4.9 3.1 ... 0.1 Iris-setosa

[10 rows x 5 columns]

sepal length sepal width petal length petal width

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.054000 3.758667 1.198667

std 0.828066 0.433594 1.764420 0.763161

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

**Step 3: Select all four features of the dataset**

data = df.iloc[:, [0,1,2,3]].values

**Step 4: Develop k-Means model with k=4**

To start with, we arbitrarily assign the value of k as 4. Sooner we will discover the optimal value of k. But, for the time being, we implement *k*-Means clustering using k=4. Using this, we instantiate the KMeans class and assign it to the variable model4:

model4 = KMeans(n_clusters = 4)

result4 = model4.fit_predict(data)

print(result4)

Output:

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 0 2 0 2 0 2 0 0 0 0 2 0 2 0 0 2 0 2 0 2 2

2 2 2 2 2 0 0 0 0 2 0 2 2 2 0 0 0 2 0 0 0 0 0 2 0 0 3 2 3 3 3 3 0 3 3 3 2

2 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 2

3 2]

**Step 5: Display the cluster centers**

centers = model4.cluster_centers_

print(centers)

Output:

[[5.53214286 2.63571429 3.96071429 1.22857143]

[5.006 3.418 1.464 0.244 ]

[6.2525 2.855 4.815 1.625 ]

[6.9125 3.1 5.846875 2.13125 ]]

**Step 6: Find optimal value of k using Elbow method**

Now, we will employ a method called the Elbow method, to find out the optimal number of clusters in the dataset. To implement this method, we need to develop some Python code here, and we will plot a graph between the number of clusters and the corresponding sum of square errors value.

To get the values used in the graph, we train multiple models using a different number of clusters and storing the value of the `intertia_`

property (WCSS) every time. As the graph generally ends up shaped like an elbow, hence such peculiar name is given to this method:

wcss = []

for i in range(1, 11):

model = KMeans(n_clusters = i).fit(data)

model.fit(data)

wcss.append(model.inertia_)

# plot the graph

plt.plot(range(1, 11), wcss)

plt.title('Elbow method')

plt.xlabel('Number of clusters')

plt.ylabel('WCSS')

plt.show()

Output:

The output graph of the Elbow method is shown below. Note that the shape of elbow is approximately formed at k=3.

As we can see, the optimal value of k is between 2 and 4, as the elbow-like shape is formed at k=3 in the above graph. So, we will now implement *k*-Means again using k=3.

**Step 7: Develop k-Means model with k=3**

model3 = KMeans(n_clusters = 3)

result3 = model3.fit_predict(data)

print(result3)

Output:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2

2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2

2 1]

**Step 8: Display the cluster centers for k=3**

centers = model3.cluster_centers_

print(centers)

Output:

[[5.006 3.418 1.464 0.244 ]

[5.9016129 2.7483871 4.39354839 1.43387097]

[6.85 3.07368421 5.74210526 2.07105263]]

**Step 9: Visualize clusters using graphic plot for k=3**

Finally, we will visualize the 3 clusters that are formed with the optimal value of k. We can obviously see the presence of the 3 clusters in the image below, with each cluster represented by a different color.

`plt.scatter(data[:,0], data[:,1], c=result3, cmap='rainbow')`

Output:

I hope that now you have got the confidence of solving any kind of clustering problem in the ML domain. It is advisable that you should use the graphic plots regularly to make the output more comprehensible.