k-mean clustering algorithm and its real use case in the security domain
What is K-means Clustering?
K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.
The main objective of the K-means algorithm is to minimize the sum of distances between and their respective cluster centroid.
Let’s now take an example to understand how K-Means actually works:
We have these several points and we want to apply k-means to create clusters for these points. Here’s how we can do it.
Step 1: Choose the number of clusters k
The first step in k-means is to pick the number of clusters, k.
Step 2: Select k random points from the data as centroids
Next, we randomly select the centroid for each cluster. Let’s say we want to have 2 clusters, so k is equal to 2 here. We then randomly select the centroid:
Here, the blue and orange box represent the centroid for these clusters.
Step 3: Assign all the points to the closest cluster centroid
Once we have initialized the centroids, we assign each point to the closest cluster centroid:
Here you can see that the points which are closer to the blue point are assigned to the blue cluster whereas the points which are closer to the orange point are assigned to the orange cluster.
Step 4: Recompute the centroids of newly formed clusters
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:
Step 5: Next, we will reassign each data point to the new centroid.
For this, we will repeat the same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line. So, these three points will be assigned to new centroids.
Step 6: As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:
We can see in the above image; there are no dissimilar data points on either side of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the below image:
How to choose the value of “K number of clusters” in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing the optimal number of clusters is a big task. There are some different ways to find the optimal number of clusters, but here we are discussing the most appropriate method to find the number of clusters or value of K. The method is given below:
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
- It executes the K-means clustering on a given dataset for different K values (ranges from 1–10).
- For each value of K, calculates the WCSS value.
- Plots a curve between calculated WCSS values and the number of clusters K.
- The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The graph for the elbow method looks like the below image:
Crime Analysis Using K-means Clustering
Steps of crime pattern analysis
- Determine the geospatial plot of crimes in the city: The first step is the collection of crime information in a given city. This is usually available from multiple places such as law enforcement reports, victimization statistical surveys, collation of newspaper articles etc. This data can be plotted on a geographical map such as the one shown above.
- The use of K-means data mining approach helps us identify patterns since it is very difficult for humans to process large amounts of data, especially if there are missing information to detect patterns.
- Clusters are useful in identifying a crime spree committed by a single or the same group of suspects. These clusters are then presented to the detectives who drill down using their domain expertise to solve the cases.
Use the following steps for cluster analysis:
- Sorting of the records — the first sorting will be done on the most important characteristics based on the detective’s experience.
- Data mining is then used to detect more complex patterns as in real life there are many attributes associated with the crime and we often have partial information available.
- Identification of significant attributes for clustering.
- Placing different weights on different attributes dynamically based on the crime types being clustered.
- Cluster the dataset for crime patterns and present the results to the detective or the domain expert along with the statistics of the important attributes.
- The detective looks at the clusters and gives recommendations.
- Unsolved crimes are clustered based on significant attributes and the result is given to detective for inspection.
- In this article, we will use the K-means approach for generating the clusters. The K-means algorithm consists of the following steps:
- Decide the number of clusters, K. The K-means cluster analysis requires you to know how many clusters to generate before the start of the algorithm.
- Initialize the K clusters or generate them randomly. Different starting points for the clusters may yield different results.
- Assign each observation to the nearest cluster center. This is an iterative technique which builds the clusters as we progress.
- Re-compute the new cluster centers. Note that you need to specify the algorithms for determining the distance between clusters.
- Repeat the process until none of the observations changed their membership in the last iteration.
- An example of the K-means cluster analysis is shown in the figure below. In this example, we show the creation of 3 clusters (each in a different color).
- Analyzing patterns and drawing conclusions This involves the analysis of each cluster formed. The computer is unable to understand what is unique about each cluster. This is where human expertise comes into play. For example, all the crimes committed in red may have been committed using a similar gun or that all the crimes shown in blue may be due to theft of jewelry where people were walking on the road and the assailants were traveling on a motor bike etc. This helps to find crime patterns and trend correlations. Once a specific pattern is detected, the law enforcement officers can deploy additional and suitable resources for detection and suppression of criminal activities.
Advantages of clustering for crime pattern analysis
There are several advantages to using this approach for crime pattern analysis:
- This approach helps us to analyze the historical crime rates and enhance the crime resolution rate of the present.
- Take actions to prevent future incidents by using preventive mechanisms based on observed patterns.
- Reduce the training time of the officers that are assigned to a new location and have no prior knowledge of site-specific crimes.
- Increase operational efficiency by optimally redeploying limited resources to the right places at the right times.
Limitations of crime pattern detection
There are a few limitations to using this approach for crime pattern detection:
- Crime pattern analysis can only help the detectives and not replace them. It is up to the human experts to interpret what the clusters are telling us.
- Data mining is sensitive to the quality of input data and that can be inaccurate sometimes. Missing information can also cause errors.
- Mapping data mining attributes is a difficult task and hence it requires a skilled data miner and a crime data analyst with good domain knowledge.
End of the matter
Thanks for reading.