Suppose we’re preparing a campaign for our shop. During the campaign we want to sell some new producs and focus on several customer groups. But… what are these groups?
We have some general knowledge, based mainly on daily observations, but how can we understand the whole picture?
The simplest solution is to stand in front of a shop and ask each customer what he or she likes.
Another, make an survey and… hold on, we’re living in the 21st century! Let’s solve this problem using Machine Learning.
What is clustering?
Simply speaking, clustering is assigning set of objects into groups. Objects in the same group are in some sense more similar to each other than to those in other groups.
Clustering is an unsupervised machine learning task mainly used in image analysis, information retrieval and other fields related to statistical data analysis.
Clustering is also used in fields not related directly to IT, like biology (identification of species, genome analysis) and astronomy (aggregation of stars, galaxies).
As there is no precise definition of a cluster, there are more than 100 different clustering algorithms. Which one is the best? It has been proved that… there is no best method :-).
Everything depends on the type of problem and data model.
However, a good clustering method is that one which produces high quality clusters with high intra-class similarity and low inter-class similarity at the same time. Quality depends also on the similarity measure used by the method and its implementation.
But the most imporant measure is its ability to discover some or all of the hidden patterns.
Clustering methods can be divided into several main types:
- Connectivity models: for example hierarchical clustering builds models based on distance connectivity.
- Centroid models: for example the K-means algorithm represents each cluster by a single mean vector.
- Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the EM algorithm.
- Density models: for example DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
- Subspace models: clusters are modeled with both cluster members and relevant attributes.
- Others: Group models and Graph-based models
Let’s go back to our shop. The first thing we want to know is what kind of customers are visiting hour shop everyday. For simplicity, we will try to find groups based only on customer age and hour he/she came to hour shop.
Once we have selected data, we need to select a suitable clustering method. K-means won’t be a good solution here, because we don’t know the final number of clusters.
As we see above, we need a model which can handle data of varying densities.
Density based models or distribution models would be probably the best solution here. For example, we will try the OPTICS algorithm.
Simply speaking, the OPTICS algorithm selects a random point in a given dataset, creates a cluster and tries to expand it over nearest points. Points are grouped into clusters of various densities.
Clustering gives us very important information – we can clearly see that people in their 50’s visit our shop in the morning and evening. Using this knowledge we can design a campaign to focus on this particular group.
We can use the above pattern with different data to find even deeper insight into customer behaviour.
Clustering random points using DBSCAN: http://lukaszkrawczyk.eu/clustering/
var clustering = require('density-clustering'); var optics = new clustering.OPTICS(); var dataset = [ [0,0],[6,0],[-1,0],[0,1],[0,-1], [45,45],[45.1,45.2],[45.1,45.3],[45.8,45.5],[45.2,45.3], [50,50],[56,50],[50,52],[50,55],[50,51] ]; // parameters: // 6 - neighborhood radius // 2 - number of points in neighborhood to form a cluster var clusters = optics.run(dataset, 6, 2); /* RESULT: [ [0, 2, 3, 4], , [5, 6, 7, 9, 8], [10, 14, 12, 13],  ] */