Anomaly Detection for JS

Internet is an extremly dynamic environment for any application. Vast amounts of data and users make management difficult. Except managing we also need to protect our system from unexpected users’ behaviour or anomalies in data.

For example, if the data we want to check are static and fairly easy to predict, we can use some kind of threshold-based alerting system. But what if data we monitor depends on many conditions, or changing inconstantly across the time? Well, we will need a system which is changing together with the environment our application is living in. This is just another field where machine learning can be applied.

In this post I will show you an example of a simple anomaly detector for JavaScript. Let’s go!

I did some research and found few solutions for above problem which mailny depends on complexity of the data.

In literature, following techniques have been proposed:

  • Distance based techniques (k-nearest neighbor, global / local outlier detection)
  • One class support vector machines (SVM)
  • Replicator neural networks
  • Cluster analysis based outlier detection
  • Pointing at records that deviate from learned association rules

In this article I will describe how to implement a simple machine learning algorithm based on global outlier detection.

Three-sigma Rule

Let’s say, we’ve got a some number of users who visits our website everyday. Sometimes, number of user is growing, sometimes falling depending on many different factors.

After analysing our data it turned out that the most influential factor is an hour and day of the week. We want to build a model which is sensitive to those factors. A model which will know what is an average situation on (for example) Monday 10 o’clock.

To do it, first we need to group data accordingly to those factors. We will get 7 groups of daily data for each day of the week. This is how “Monday group” looks like:

GroupsTo get hourly data across whole group, we need to cut the graphs perpendicularly. Hourly data from each day may be considered as a random variable modeled by normal distribution with expected value and standard deviation.

probability copy

Three sigma rule (also known as 68–95–99.7 rule) states that almost all values lie within three standard deviations of the mean in a normal distribution. That means, all values which are laying outside this boundary should be considered as anomalies.

Anomaly detector

OK, we can start training our detector. First of all, we need to calculate expected value and standard deviation for each hour (e.g. for every Monday) using following forumlas:

3dc6276d2ead6518848030bdefd445f4
Expected value
d86c4d2acb13ef401572e4703834783a
Standard deviation
/**
 * Calculating expected value (E) of a random variable
 *
 * @param {array} X - random variable
 * @param {integer} pow - power used in summation operator (optional, default = 1)
 * @returns {float}
 */
    expectedValue: function (X, pow) {
        var sum = 0,
            n = X.length;

        pow = pow || 1; // set default value if not set
        if (n == 0) return 0;  // if random variable is empty, return 0

        for (var i = 0; i < n; i++)
            sum += Math.pow(X[i], pow) / this.accuracy;

        return sum / (n / this.accuracy);
    },

/**
 * Calculating standard deviation (sigma) of a random variable
 *
 * @param {array} X
 * @param {float} Ex - expexted value of X (optional)
 * @returns {float}
 */
    standardDeviation: function (X, Ex) {
        var Ex2 = this.expectedValue(X, 2);
        Ex = Ex || this.expectedValue(X); // calculate expected value if not set
        return Math.sqrt(Ex2 - Math.pow(Ex, 2)); // return squared root of the variation
    }

Results, associated with hour and day of week should be stored in some kind of memory (RAM, file or a database). In order to get best results, we need to train our classifier on a big variety of data and (most important!) without anomalies. To keep our detector smart & healthy, training should be done everyday.

To detect anomaly, check if the number of visitors from last hour is within boundaries of

EX – 3σ < y < EX + 3σ

Where EX is expected value of given hour (in given weekday), y is number of visitors we want to test and σ is standard deviation.

/**
 * Classifier
 * true = value is correct
 * false = value is an outlier
 *
 * @param {float} value - Random variable value
 * @param {float} E - Expexted value of X
 * @param {float} sigma - Standard deviation of X
 * @returns {boolean}
 */
    test: function (value, E, sigma) {
        return (Math.abs(E - value) <= (3 * sigma));
    }

As you can see, classifier has very low computation cost (general rule of machine learning algorithms).

If we need to check data more frequently (real time), we can train our classifier on data from every minute/second of a day.

Testing

When dealing with any kind of supervised learning algorithms, testing on two kind of data sets is neccesary:

  1. training data set containing “clean” data without anomalies
  2. testing data set containing “dirty” data with anomalies

Testing data set should be tagged with labels saying if data is an anomaly or not. While testing it is necessary to calculate error function and compare its results for different methods.

Simple error function is given by the sum of the squares of the errors between the result of chosen method for each data point xn and the corresponding target values tn from testing data set:

error

Conclusion

Anomaly detection is very interesting task and can be solved in various ways. We need to remember that before we stick to one method we should check our data in detail and compare different algorithms.

Whole source code of anomaly detector can be found on my github account.