k-mean clustering algorithm and its real use case in the security domain

What is K-means Clustering?

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

Step 1: Choose the number of clusters k

The first step in k-means is to pick the number of clusters, k.

Step 2: Select k random points from the data as centroids

Next, we randomly select the centroid for each cluster. Let’s say we want to have 2 clusters, so k is equal to 2 here. We then randomly select the centroid:

Step 3: Assign all the points to the closest cluster centroid

Once we have initialized the centroids, we assign each point to the closest cluster centroid:

Step 4: Recompute the centroids of newly formed clusters

As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:

Step 5: Next, we will reassign each data point to the new centroid.

For this, we will repeat the same process of finding a median line. The median will be like below image:

Step 6: As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.

We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below image:

How to choose the value of “K number of clusters” in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing the optimal number of clusters is a big task. There are some different ways to find the optimal number of clusters, but here we are discussing the most appropriate method to find the number of clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:

  • For each value of K, calculates the WCSS value.
  • Plots a curve between calculated WCSS values and the number of clusters K.
  • The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of K.

Crime Analysis Using K-means Clustering

Steps of crime pattern analysis

  • The use of K-means data mining approach helps us identify patterns since it is very difficult for humans to process large amounts of data, especially if there are missing information to detect patterns.
  • Clusters are useful in identifying a crime spree committed by a single or the same group of suspects. These clusters are then presented to the detectives who drill down using their domain expertise to solve the cases.

Use the following steps for cluster analysis:

  • Sorting of the records — the first sorting will be done on the most important characteristics based on the detective’s experience.
  • Data mining is then used to detect more complex patterns as in real life there are many attributes associated with the crime and we often have partial information available.
  • Identification of significant attributes for clustering.
  • Placing different weights on different attributes dynamically based on the crime types being clustered.
  • Cluster the dataset for crime patterns and present the results to the detective or the domain expert along with the statistics of the important attributes.
  • The detective looks at the clusters and gives recommendations.
  • Unsolved crimes are clustered based on significant attributes and the result is given to detective for inspection.
  • In this article, we will use the K-means approach for generating the clusters. The K-means algorithm consists of the following steps:
  • Decide the number of clusters, K. The K-means cluster analysis requires you to know how many clusters to generate before the start of the algorithm.
  • Initialize the K clusters or generate them randomly. Different starting points for the clusters may yield different results.
  • Assign each observation to the nearest cluster center. This is an iterative technique which builds the clusters as we progress.
  • Re-compute the new cluster centers. Note that you need to specify the algorithms for determining the distance between clusters.
  • Repeat the process until none of the observations changed their membership in the last iteration.
  • An example of the K-means cluster analysis is shown in the figure below. In this example, we show the creation of 3 clusters (each in a different color).

Advantages of clustering for crime pattern analysis

There are several advantages to using this approach for crime pattern analysis:

  • Take actions to prevent future incidents by using preventive mechanisms based on observed patterns.
  • Reduce the training time of the officers that are assigned to a new location and have no prior knowledge of site-specific crimes.
  • Increase operational efficiency by optimally redeploying limited resources to the right places at the right times.

Limitations of crime pattern detection

There are a few limitations to using this approach for crime pattern detection:

  • Data mining is sensitive to the quality of input data and that can be inaccurate sometimes. Missing information can also cause errors.
  • Mapping data mining attributes is a difficult task and hence it requires a skilled data miner and a crime data analyst with good domain knowledge.