Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand preferences of your costumers to scale up your business. Is it possible for you to look at details of each costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call clustering.
K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.
The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
K-means algorithm is an iterative algorithm that tries to partition the datasets into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
- Choose your value of K
- Randomly select K data points to represent the cluster centroids
- Assign all other data points to its nearest cluster centroids
- Reposition the cluster centroid until it is the average of the points in the cluster
- Repeat steps 3 & 4 until there are no changes in each cluster
Since the algorithm requires the user to specify the number of clusters K to look for, and it doesn’t learn it from the data, it is one of the most difficult aspects of using this algorithm. It’s difficult to say if any given value of K is incorrect. Often this value is determined through having extensive domain knowledge, and experience to know an ideal value of K. If this isn’t the case for your current needs then the elbow method is commonly used in the industry to identify the ideal value of K.
The elbow method uses the sum of squared distance (SSE) to choose an ideal value of k based on the distance between the data points and their assigned clusters. We would choose a value of k where the SSE begins to flatten out and we see an inflection point. When visualized this graph would look somewhat like an elbow, hence the name of the method.
The graph above shows that
k = 4 is probably a good choice for the number of clusters. There are situations when the graph does not look like an elbow, this makes things very difficult to choose the value of k.
- Scales to large data
- Always converges
- Often (not always) faster than other common clustering methods like hierarchical clustering
- Clustering imbalanced data
- Manual choice of K
- Dependent on initial assumptions
- Scaling with high dimensions
Use cases of K means in Security Domain
1.Call record detail analysis
A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.
2. Automatic clustering of it alerts
Large enterprise it infrastructure technology components such as network, storage, or database generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.
3. Rideshare data analysis
the publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.
4. Crime document classification
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.