But how this purely mathematical theory applies to the real world data? As the amount of data grows rapidly, we need to find new ways to understand it and analyse it. In some cases, good solution is to see the broader image, go to the higher level of abstraction.

Topology gives us novel approach to understand the data by analysing the shape of the multi-dimensional dataset.

In the next following articles we will go through the most important, easiest to understand parts of huge Topology Theory. We will try to build some examples based on real-world data to understand what’s amazing about it.

Topology has a lot of meanings, so many that it becomes vague what topology exactly is. Most general definition is that topology essentially studies the connectedness and continuity of various structures. Simply speaking, if two objects are somehow close to each other, we group them together.

In topology every shape is thought of to be made of clay. We can stretch it, squeeze it, and as long as we don’t make holes in it, it’s still considered to be the same object. Simply speaking, topology is about how many multi-dimensional holes a structure has.

From topological point of view a circle and a square is the same thing. Same is with a doughnut and a coffee mug! (animation on right).

Moreover, every shape can be approximated using n-dimensional “simplices“: point, line, triangle, tetrahedron, etc. We will use above properties to analyse the data.

In my previous post about clustering, I’ve been discussing different methods of dividing dataset into several distinct groups. Most of discussed methods is based mainly on proximity of the data points in some n-dimensional Euclidean space.

We’re going to use similar approach, but this time we will go step further, reconstruct the shape of the data and see how the things looks like from higher perspective. We’ll be analysing generalized shape of a dataset, something which is called a simplical complex.

Formally speaking, **if we have an finite point set in R ^{d} and we assume the data was sampled from underlying space X (manifold), the goal is to recover the topology of X**.

This task can be divided into two step process:

- Geometric: approximate X with combinatorial representation (simplical complex)
- Topologic: compute topological invariant, (e.g. persistent homology)

Both tasks are quite hard and still we don’t have good, robust algorithms to solve them quickly.

Approximation of underlying space can be done in few different ways. The first thing we need to do is to build a structure called “good cover” – it’s a set of cells (vertices with radius) which covers our whole dataset in a special way. For simplicity, we’ll skip the details of this step and assume that the dataset we have is covering underlying space in a “good” way. Next step is go generate a structure similar to undirected graph which is called a “nerve”. Most commonly used nerve types are Čech complex and Vietrois-Rips complex.

Generally speaking, the Čech complex represents a group of cells by a simplex if all of them have a non-empty intersection. The Čech complex considers the intersection between cells. As a result, it always represents exactly the topology of the dataset. Vietrois-Rips complex is a subset of Cech complex, based only on computing distance threshold, which makes it much much easier to construct in practical applications.

OK, let’s build a simple nerve. We have the set of points in R^{2}. Imagine that the shape of the underlying data space resembles the number “8” (with two holes in it). Let’s draw a circle around each point. If the distance between two points is less than some threshold points becomes connected.

If we extend the radius, the nerve will cover underlying space revealing its shape.

Most exciting part is understanding what was and what wasn’t preserved about the original data space. The shape and the size, area of the original space is not represented in the simplical complex. But on the other hand, the complex has two “holes” in it which match the two “holes” in the original shape. The rest has been “filled in” by a triangles. Another fact is that our complex is one whole component and so was the original space.

So sum up, simplical complex has preserved the number of “holes” of the original space and the connectedness of the components. In my next blog posts, it will turn out that these properties are even more important then the size or volume of original dataset.

Until then…

I’ve prepared an open-source library for computing Generalized Cech Complex and Vietrois-Rips complex in JavaScript. You can see **demo** of the latter here:

**Happy coding!**

There are several things we can focus on:

- optimizing loops
- using TypedArrays instead of standard Arrays
- using WebWorkers and TransferableObjects

This time we will check which kind of loop we should choose to get the best performance.

We’re going to test following code on Firefox 41 and the newest Microsoft Edge 20. Result is an average execution time of code running on both browsers.

Sample data:

```
var arr = [];
for (i = 0; i < 100000; i++) {
arr[i] = 'Value' + Math.random();
}
```

Test code:

```
// simple for loop
for (var i = 0; i < arr.length; i++) {
arr[i];
}
// for loop with cached variable
for (var i = 0, l = arr.length; i < l; i++) {
arr[i];
}
// while loop
var i = 0;
var len = arr.length;
while (i++ < len) {
arr[i];
}
// reverse while loop
var i = arr.length;
while (i--) {
arr[i];
}
// for..in loop
for (var i in arr) {
arr[i];
}
// for..of loop
for (var i of arr) {
i;
}
```

Results for a dense array containing 100,000 items:

**basic for loop**: 87.44ms**cached for loop**: 85.9ms**while (i < len)**: 86.48ms**while (i–)**:**59.04ms****for..in**: 85.69ms**for..of**: 79.05ms

What’s interesting, using block-scoped variables (**let** keyword) gives us better results:

**basic for loop**: 33.64ms**cached for loop**:**22.84ms****while (i < len)**: 41.97ms**while (i–)**: 70.71ms**for..in**: 47.11ms**for..of**: 50.05ms

Probably it’s becase of the way how JS compiler is allocating memory for block-scoped variables.

It’s getting even more interesting when we test loops on sparse dataset:

```
var arr = [];
for (let i = 0; i < 100000; i++) {
if (Math.random() < 0.2) {
arr[i] = 'Value' + Math.random();
}
}
```

Results:

**basic for loop**: 43.11ms**cached for loop**: 33.36ms**while (i < len)**: 61.06ms**while (i–)**: 52.16ms**for..in**:**10.55ms****for..of**: 53.91ms

The reason why **for..in** loop is fastest is somehow surprising, but when we think about the way how this kind of loop works, it makes sense. For loop is going though all keys in an object. If array is sparse, there will be fewer indices to traverse.

Next time I’m going to talk about low level **binary arrays**, how to do C-like **memcpy()** in JavaScript, and how it can hugely speed up computation process.

Photo by **KhaOS**: https://commons.wikimedia.org/wiki/File:Jenson_Button_2006_Canada.jpg

We have some general knowledge, based mainly on daily observations, but how can we understand the whole picture?

The simplest solution is to stand in front of a shop and ask each customer what he or she likes.

Another, make an survey and… hold on, we’re living in the 21st century! Let’s solve this problem using Machine Learning.

Simply speaking, clustering is **assigning set of objects into groups**. Objects in the same group are in some sense more similar to each other than to those in other groups.

Clustering is an unsupervised machine learning task mainly used in image analysis, information retrieval and other fields related to statistical data analysis.

Clustering is also used in fields not related directly to IT, like biology (identification of species, genome analysis) and astronomy (aggregation of stars, galaxies).

As there is no precise definition of a cluster, there are more than 100 different clustering algorithms. Which one is the best? It has been proved that… there is no best method :-).

Everything depends on the type of problem and data model.

However, a **good clustering** method is that one which produces high quality clusters with high intra-class similarity and low inter-class similarity at the same time. Quality depends also on the similarity measure used by the method and its implementation.

But the most imporant measure is its ability **to discover some or all of the hidden patterns**.

Clustering methods can be divided into several main types:

**Connectivity models**: for example hierarchical clustering builds models based on distance connectivity.**Centroid models**: for example the K-means algorithm represents each cluster by a single mean vector.**Distribution models**: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the EM algorithm.**Density models**: for example DBSCAN and OPTICS defines clusters as connected dense regions in the data space.**Subspace models**: clusters are modeled with both cluster members and relevant attributes.**Others:**Group models and Graph-based models

Let’s go back to our shop. The first thing we want to know is what kind of customers are visiting hour shop everyday. For simplicity, we will try to find groups based only on customer age and hour he/she came to hour shop.

Once we have selected data, we need to select a suitable clustering method. K-means won’t be a good solution here, because we don’t know the final number of clusters.

As we see above, we need a model which can handle data of varying densities.

Density based models or distribution models would be probably the best solution here. For example, we will try the OPTICS algorithm.

Simply speaking, the OPTICS algorithm selects a random point in a given dataset, creates a cluster and tries to expand it over nearest points. Points are grouped into clusters of various densities.

Clustering gives us very important information – we can clearly see that people in their 50’s visit our shop in the morning and evening. Using this knowledge we can design a campaign to focus on this particular group.

We can use the above pattern with different data to find even deeper insight into customer behaviour.

Clustering random points using DBSCAN: http://lukaszkrawczyk.eu/clustering/

If you want to play with density clustering in JavaScript, I made an open source library: https://github.com/LukaszKrawczyk/clustering

Sample usage:

```
var clustering = require('density-clustering');
var optics = new clustering.OPTICS();
var dataset = [
[0,0],[6,0],[-1,0],[0,1],[0,-1],
[45,45],[45.1,45.2],[45.1,45.3],[45.8,45.5],[45.2,45.3],
[50,50],[56,50],[50,52],[50,55],[50,51]
];
// parameters:
// 6 - neighborhood radius
// 2 - number of points in neighborhood to form a cluster
var clusters = optics.run(dataset, 6, 2);
/*
RESULT:
[
[0, 2, 3, 4],
[1],
[5, 6, 7, 9, 8],
[10, 14, 12, 13],
[11]
]
*/
```

For example, if the data we want to check are static and fairly easy to predict, we can use some kind of threshold-based alerting system. But what if data we monitor depends on many conditions, or changing inconstantly across the time? Well, we will need a system which is changing together with the environment our application is living in. This is just another field where **machine learning** can be applied.

In this post I will show you an example of a simple anomaly detector for JavaScript. Let’s go!

I did some research and found few solutions for above problem which mailny depends on complexity of the data.

In literature, following techniques have been proposed:

- Distance based techniques (k-nearest neighbor, global / local outlier detection)
- One class support vector machines (SVM)
- Replicator neural networks
- Cluster analysis based outlier detection
- Pointing at records that deviate from learned association rules

In this article I will describe how to implement a simple machine learning algorithm based on global outlier detection.

Let’s say, we’ve got a some number of users who visits our website everyday. Sometimes, number of user is growing, sometimes falling depending on many different factors.

After analysing our data it turned out that the most influential factor is an hour and day of the week. We want to build a model which is sensitive to those factors. A model which will know what is an average situation on (for example) Monday 10 o’clock.

To do it, first we need to group data accordingly to those factors. We will get 7 groups of daily data for each day of the week. This is how “Monday group” looks like:

To get hourly data across whole group, we need to cut the graphs perpendicularly. Hourly data from each day may be considered as a random variable modeled by normal distribution with expected value and standard deviation.

Three sigma rule (also known as 68–95–99.7 rule) states that almost all values lie within three standard deviations of the mean in a normal distribution. That means, all values which are laying outside this boundary should be considered as anomalies.

OK, we can start training our detector. First of all, we need to calculate expected value and standard deviation for each hour (e.g. for every Monday) using following forumlas:

Expected value |
Standard deviation |

```
/**
* Calculating expected value (E) of a random variable
*
* @param {array} X - random variable
* @param {integer} pow - power used in summation operator (optional, default = 1)
* @returns {float}
*/
expectedValue: function (X, pow) {
var sum = 0,
n = X.length;
pow = pow || 1; // set default value if not set
if (n == 0) return 0; // if random variable is empty, return 0
for (var i = 0; i < n; i++)
sum += Math.pow(X[i], pow) / this.accuracy;
return sum / (n / this.accuracy);
},
/**
* Calculating standard deviation (sigma) of a random variable
*
* @param {array} X
* @param {float} Ex - expexted value of X (optional)
* @returns {float}
*/
standardDeviation: function (X, Ex) {
var Ex2 = this.expectedValue(X, 2);
Ex = Ex || this.expectedValue(X); // calculate expected value if not set
return Math.sqrt(Ex2 - Math.pow(Ex, 2)); // return squared root of the variation
}
```

Results, associated with hour and day of week should be stored in some kind of memory (RAM, file or a database). In order to get best results, we need to train our classifier on a big variety of data and (most important!) without anomalies. To keep our detector smart & healthy, training should be done everyday.

To detect anomaly, check if the number of visitors from last hour is within boundaries of

E*X* – 3σ < y < E*X* + 3σ

Where **E X** is expected value of given hour (in given weekday),

```
/**
* Classifier
* true = value is correct
* false = value is an outlier
*
* @param {float} value - Random variable value
* @param {float} E - Expexted value of X
* @param {float} sigma - Standard deviation of X
* @returns {boolean}
*/
test: function (value, E, sigma) {
return (Math.abs(E - value) <= (3 * sigma));
}
```

As you can see, classifier has very low computation cost (general rule of machine learning algorithms).

If we need to check data more frequently (real time), we can train our classifier on data from every minute/second of a day.

When dealing with any kind of supervised learning algorithms, testing on two kind of data sets is neccesary:

- training data set containing “clean” data without anomalies
- testing data set containing “dirty” data with anomalies

Testing data set should be tagged with labels saying if data is an anomaly or not. While testing it is necessary to calculate error function and compare its results for different methods.

Simple error function is given by the sum of the squares of the errors between the result of chosen method for each data point x_{n} and the corresponding target values t_{n} from testing data set:

Anomaly detection is very interesting task and can be solved in various ways. We need to remember that before we stick to one method we should check our data in detail and compare different algorithms.

Whole source code of anomaly detector can be found on my github account.

]]>Similarity can be calculated in various ways, using different mathematical models like vector space, probabilistic model or set theory.

How to do it? The first idea which came to our mind is checking whether all attributes from one object (e.g. words in document) exist in another one. Unfortunately, this method is very slow in case of large data sets, therefore more sophisticated methods are necessary.

Firstly, we need to understand that document and its contents are abstract concepts for a machine. Therefore measuring document similarity in most of the cases is about measuring distance between all of it’s attributes (for example, represented as numbers).

There are four most popular similarity measurement methods:

- Euclidean distance
- Cosine similarity
- Jaccard / Tanimoto coefficient
- Pearson Correlation

The simplest method of measuring similarity of two objects is to use the Euclidean distance, where each attribute of a doument is represented as a separate dimension in vector space, called preference space. Every data object becomes a point in such space.

Good example is a RGB color space, where each color (object) is represented as a point in 3 dimensional space (red, green and blue).

To measure distance between two colors we will use following equation:

This equation can be used in another form, in order to get higher values for high similarity: 1 / (1 + d(p, q)), where d() is euclidean distance function.

This method is very fast and useful if all objects we want to compare has same number of attributes (e.g. color always has 3 attributes: RGB) and this number is not too big. This is very important because if vector space contain large amount of dimensions a strange phenomenon can appear. It is called a “Curse of dimensionality” and may have negative influence on distance measurements.

Implementation in PHP:

```
/**
* Euclidean distance
* d(a, b) = sqrt( summation{i=1,n}((b[i] - a[i]) ^ 2) )
*
* @param array $a
* @param array $b
* @return boolean
*/
public function euclidean(array $a, array $b) {
if (($n = count($a)) !== count($b)) return false;
$sum = 0;
for ($i = 0; $i < $n; $i++)
$sum += pow($b[$i] - $a[$i], 2);
return sqrt($sum);
}
// usage:
euclidean(array(120, 255, 0), array(130, 255, 10));
```

Another method, also based on vector space model is frequently used in text search engines. Mainly because it is independed from the number of attributes (space dimensions) and space sparcity – situation when the most of words in two documments are incommon.

Comparing to other methods Cosine Similarity create much more better metric for determining similarities of text documents.

The main idea behind this similarity metric is all documents are represented as a vectors and we are trying to find cosine of angle between them. If value is closer to 1 two documents are highly similar, if value is 0 documents does not share any attributes, if value is close to -1 vectors has completely opposite direction.

In order to measure cosine similarity we need two important mathematical tools: a norm and a dot product.

Implementation in PHP is quite simple:

```
/**
* Euclidean norm
* ||x|| = sqrt(x・x) // ・ is a dot product
*
* @param array $vector
* @return mixed
*/
protected function norm(array $vector) {
return sqrt($this->dotProduct($vector, $vector));
}
/**
* Dot product
* a・b = summation{i=1,n}(a[i] * b[i])
*
* @param array $a
* @param array $b
* @return mixed
*/
protected function dotProduct(array $a, array $b) {
$dotProduct = 0;
// to speed up the process, use keys with non-empty values
$keysA = array_keys(array_filter($a));
$keysB = array_keys(array_filter($b));
$uniqueKeys = array_unique(array_merge($keysA, $keysB));
foreach ($uniqueKeys as $key) {
if (!empty($a[$key]) && !empty($b[$key]))
$dotProduct += ($a[$key] * $b[$key]);
}
return $dotProduct;
}
```

Using above tools we can define cosine similarity:

```
/**
* Cosine similarity for non-normalised vectors
* sim(a, b) = (a・b) / (||a|| * ||b||)
*
* @param array $a
* @param array $b
* @return mixed
*/
public function cosinus(array $a, array $b) {
$normA = $this->norm($a);
$normB = $this->norm($b);
return (($normA * $normB) != 0)
? $this->dotProduct($a, $b) / ($normA * $normB)
: 0;
}
// usage:
cosinus(array(1, 1, 1, 0, 3), array(2, 3, 0, 0, 1));
```

Above implementation is very flexible and can be used for any kind of numerical data. For example, each word of a document can be represented as it’s importance for the document. This is called Term Frequency and is a part of TFIDF – a document numerical statistic widely used in Machine Learning.

In case we don’t want to play with numbers, below I’ve prepared simple implementation working with arrays containing document words:

```
/**
* Cosine similarity of sets with tokens
* sim(a, b) = (a・b) / (||a|| * ||b||)
*
* @param array $a
* @param array $b
* @return mixed
*/
public function cosinusTokens(array $tokensA, array $tokensB) {
$dotProduct = $normA = $normB = 0;
$uniqueTokensA = $uniqueTokensB = array();
$uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
foreach ($uniqueMergedTokens as $token) {
$x = isset($uniqueTokensA[$token]) ? 1 : 0;
$y = isset($uniqueTokensB[$token]) ? 1 : 0;
$dotProduct += $x * $y;
$normA += $x;
$normB += $y;
}
return ($normA * $normB) != 0
? $dotProduct / sqrt($normA * $normB)
: 0;
}
// usage:
cosinusTokens(array('this', 'is', 'my', 'car'), array('this', 'is', 'my', 'home'));
```

Of course, implementation depend on needs so remember – keep it simple and straightforward!

As I mentioned above, every solution depend on a case and needs. In some cases each attribute has binary value which describe the absence or presence of a characteristic. Thus, the best solution will be to determine similarity through measuring intersection of both data sets.

Jaccard / Taniomoto Coefficient is a ratio of the intersecting set to the union set. It’s represented as follows:

PHP implementation:

```
/**
* Jaccard similarity index
*
* @param array $a
* @param array $b
* @return mixed
*/
public function jaccard(array $a, array $b) {
return count($this->intersection($a, $b)) / count($this->union($a, $b));
}
/**
* Set intersection
*
* @param array $a
* @param array $b
* @return array
*/
protected function intersection(array $a, array $b) {
return array_values(array_intersect($a, $b));
}
/**
* Set union
*
* @param array $a
* @param array $b
* @return array
*/
protected function union(array $a, array $b) {
return array_values(array_unique(array_merge($a, $b)));
}
// usage:
jaccard(array('cat', 'cow', 'dog') array('cow', 'bird', 'cat'));
```

This rule is one of the most famous theorem in Statistics and is widely used in many fields from engineering and economics to medicine and law.

Naive Bayes Clasifier is rather simple algorithm among classification algorithms – other, more complex algorithms giving better accuracy, but if NBC is trained well on a large data set it can give surprisingly good results for much less effort.

Bayes rule is a way of looking at conditional probabilities that allows to flip the condition around in a convenient way. A conditional probability normally written as P(A|B) is a probably that event A will occur, given the evidence B.

What makes it “naive” is that we make an assumption that all observed values are independent from each other. For example – in case of natural language, algorithm won’t understand the connection between words “delicious” and “cake” which in real world appear close to each other. Nevertheless, as I mentioned above – if system is trained well, results can be much more better than we expect.

Let’s imagine we want to detect language of a document – we will need to calculate the probability that a given document belongs to a given language (class). This probability can be written as follows:

Where:

*p(S|D)*is a probability that document D belongs to class S*p(S)*is the probability of class S*p(w*– probability of each token (word) from document appear in class S_{i}|S)*p(D)*– probability of document D

As we don’t need to calculate a precise probability (just only rank classes from highest to lowest probable) and the probability *P(D)* is always the same, we can drop it and rewrite the rule:

Let’s sum up – in order to count score of document D belonging to class S we need to calculate:

Where *t*_{i} is a token (word) from document *D*. For more simplicity, we assumed that probability *p(S)* of each class is equal. The extra *1* and *count(all tokens)* is called *Laplace smoothing* and prevent from multiplying by zero. If we didn’t have it any document with an unseen token in it would score zero.

OK, after we prepared mathematical model it’s time to write some code. First of all we need to train our classifier – provide big amount of documents, each tagged with class. Next we will need to tokenise document (divide into separate words) and count occurrences of each token in document, total number of tokens, documents, classes, etc.

```
public function train(IDataSource $dataSource, $class) {
// class initialization
if (!in_array($class, $this->classes)) {
$this->classes[] = $class;
$this->classTokenCounter[$class] = 0;
$this->classDocumentCounter[$class] = 0;
}
// train class using provided documents
while ($document = $dataSource->getNextDocument()) {
$this->documentCounter++;
$this->classDocumentCounter[$class]++;
// add all documents tokens to global vocabulary
foreach ($this->tokenise($document) as $token) {
$this->vocabulary[$token][$class] =
isset($this->vocabulary[$token][$class])
? $this->vocabulary[$token][$class] + 1
: 1;
$this->classTokenCounter[$class]++;
$this->tokenCounter++;
}
}
}
private function tokenise($text) {
return preg_split('/\s+/', mb_strtolower($text));
}
```

After we successfully trained our classifier, we can classify text by calculating it’s score as I explained above.

```
public function classify($document, $showProbabilities = false) {
$tokens = $this->tokenise($document);
$posteriors = array();
// for each class count posterior probability
foreach ($this->classes as $class) {
$posteriors[$class] = $this->posterior($tokens, $class);
}
arsort($posteriors);
return ($showProbabilities) ? $posteriors : key($posteriors);
}
private function posterior($tokens, $class) {
$posterior = 1;
foreach ($tokens as $token) {
$count = isset($this->vocabulary[$token][$class])
? $this->vocabulary[$token][$class]
: 0;
// multiply by token probability, add Laplace smoothing
$posterior *= ($count + 1) / ($this->classTokenCounter[$class]
+ $this->tokenCounter);
}
$posterior = $this->prior($class) * $posterior;
return $posterior;
}
private function prior($class) {
return $this->classDocumentCounter[$class] / $this->documentCounter;
}
```

Whole source code with working example can be found on my github account:

https://github.com/LukaszKrawczyk/PHPNaiveBayesClassifier

Happy classifying!

In general, machine learning based classification is about learning how to separate two sets of data examples. Basing on this knowledge system can correctly put unseen examples into one of the other set. Spam filter is a very good example of automatic classification system. Let’s imagine a two dimensional space with points, SVM algorithm is about finding a line (hyperplane) that separate points into two classes.

The main idea behind this algorithm is that the gap dividing points should be as wide as it’s possible.

SVMs are working with vectors i n-dimensional space. Number of dimension is equal to the number of “features” one object can have. Therefore whole space describe all possible configuration of objects (all possible solutions). For example, if we want to classify human gender basing only on data containing height, weight and feet size, vector space will have three dimensions.

So what is “feature” then? Feature is a property of an object (eg. height, word, etc.) and it’s value describing importance of a particular property to an object. In case of previous example (gender classification) feature can be a height or weight, in case of text document it can be a number of occurrence of particular word in whole document.

Talking about document classification, there are several ways to describe importance of a property to an object.

**TF (term frequency)**: how many times a word appear in a document.**boolean**: 1 if term exists in document, 0 if not exists**logarithmial**:*tf(t,d) = 1 + log(tf(t,d))*where*t*– term,*d*– document,*tf*– term frequency**normalized**:*tf(t,d) = f(t,d) / max(f(w,d))*where w is any word of a document, f is frequency of a word in document**IDF (inverse document frequency)**: is a measure of whether the term (t) is common or rare across all documents (D).*idf(t,D) = log(|D| / |d containing t|)***TFIDF**:*tfidf(t,d,D) = tf(t,d) x idf(t,D)*– this method is very popular and gives quite good results

As I mentioned above, in case of two dimensions, we are looking for a line dividing two sets of points. In two dimensions, any line can be defined as* y=ax+b*. In case of higher dimensions *ax+b* becomes *w・x + b*, where *w* is normal vector to hyperplane and *w・x* is a dot product between vector *w* and *x*. In order to find such hyperplane we need to find values of parameters *w* and *b* which maximize the margin between hyperplane and points from both classes.

Above task need some sophisticated mathematical tools like quadratic programming optimization, Kernel Trick or Lagrange Multipliers. Therefore I can advice you to visit amazing blog of Ian Barber, developer at Google, where you’ll find more details about this algorithm and it’s implementation in PHP. Also on Wikipedia you can find very good explanation about mathematical background of solving this task.

The strongest point of SVM comparing to other classification algorithms is that the classification does not need to be linear. It means data space can be divided by a hyperplane laying in much more higher dimension and the original data space.

Good visualization can be seen below:

Right, in most of cases we will need an algorithm which can separate data to multiple classes.

As SVM is a binary classification algorithm, in case we want to separate objects into more than two classes we need to train this algorithm separately for each class in one of following way: *OneToRest*, *OneToOne* or *OneToAll*.

After training such algorithm can help us classify many different kind of data relatively quickly.

First think we need to do is to choose the area where we will be checking colours. This area is called ROI and usually it’s a rectangle located in the center of an image, covering 5/6 of it’s size. ROI is necessary because the most of images has relevant data located in the center (faces, logos, etc…), therefore colours on the edges won’t be important for us.

The next step is to decide the size of a detection mesh – distance between points where we will check image colour. Of course – the more points we check the better, but in case of big images algorithm can become very slow. Therefore we need to set up fixed number of points per image width / height. I decided to check 1/10^{th} of width and 1/10^{th} of image height.

OK, let’s retrieve pixel colour:

```
... open image ...
for ($x = $roi['x']['min']; $x <= $roi['x']['max']; $x += $pointDistance['x']) {
for ($y = $roi['y']['min']; $y <= $roi['y']['max']; $y += $pointDistance['y']) {
// get color vector of current pixel
$colorVector = imageColorsForIndex($img, imageColorAt($img, $x, $y));
... do something with colour ...
}
}
```

The next step is to decide whether we want to map pixel colours to a predefined palette or not. If not, there is nothing more you have to do than saving data of pixel colour acquired using above code. But if we want to map pixel colour to a palette, things getting more complicated. In my case I had a palette of 12 colours: [red, orange, yellow, green, turquoise, blue, purple, pink, white, gray, black, brown].

The hearth of image colour detection algorithm is finding the shortest distance between two colours; one from palette and another from image pixel.

This leads us to checking the **shortest distance between two n-dimensional vectors in one of chosen colour space**.

There are three main colour spaces:

- RGB – red, green and blue
- HSV / HSB – hue, saturation and value
- LAB – lightness, a and b color-opponent dimensions based on XYZ color space coordinates

The easiest way is to compare colour vectors in RGB space – we will just need to solve Euclidean Vector Distance equation for each pixel and colour in our palette:

As for most of images, this method will give good results but unfortunately – images containing pale colours like light brown/yellow/green/pink will be mismatched. This is because RGB colour space does not reflect the way how human eye sees colours. In RGB pale pink and brown are much more closer colours than pink and red (for human, obviously not).

Distance between colours will be as follows:

```
function colorDistance($a, $b) {
return sqrt(pow($a['red'] - $b['red'], 2)
+ pow($a['green'] - $b['green'], 2)
+ pow($a['blue'] - $b['blue'], 2));
}
```

Much more better results we can get using Hue-Saturation-Value color space. HSV space is a cone with colour wheel at the bottom.

In order to convert colours to HSV color space, we need to:

```
function rgbToHsv($rgb) {
$h = 0;
$s = 0;
$v = 0;
$r = $rgb[0] / 255;
$g = $rgb[1] / 255;
$b = $rgb[2] / 255;
$max = max( $r, $g, $b );
$min = min( $r, $g, $b );
if ( $max === $min ) $h = 0;
else if ( $max === $r ) $h = ( 60 * ($g - $b) / ( $max - $min ) + 360 ) % 360;
else if ( $max === $g ) $h = 60 * ( $b - $r ) / ( $max - $min ) + 120;
else if ( $max === $b ) $h = 60 * ( $r - $g ) / ( $max - $min ) + 240;
if ( $max === 0 ) $s = 0;
else $s = 1 - $min / $max;
$v = $max;
$hsv = array($h, $s * 100, $v * 100);
return $hsv;
}
```

In HSV colour space brightest colours are on the bottom, darkest on the top of a cone. Pale colours are close to cone axis. Hue is a degree on a colour wheel with values from 0-360, therefore during counting distance between colours we need to “fold” one dimension to get the actual proper value.

```
function colorDistance($a, $b) {
$hueDiff = 0;
// folding "H" dimension
if ($a[0] > $b[0]) {
$hueDiff = min($a[0] - $b[0], $b[0] - $a[0] + 360);
} else {
$hueDiff = min($b[0] - $a[0], $a[0] - $b[0] + 360);
}
return sqrt(pow($hueDiff, 2)
+ pow($a[1] - $b[1], 2)
+ pow($a[2] - $b[2], 2));
}
```

That’s all! As a result of above code you’ll get shortest distance between two colours which help you to decide which colour from palette is the closest to colour of a pixel.

As far as I know, the best results with colour detection algorithm you can get using LAB colour space. Unfortunately, conversion between RGB and LAB colour space is much more complicated so I decided to write separate post about this topic.

http://www.emanueleferonato.com/2009/08/28/color-differences-algorithm/

http://stevehanov.ca/blog/index.php?id=116

http://en.wikipedia.org/wiki/Color_difference

http://homepages.inf.ed.ac.uk/rbf/PAPERS/iccv99.pdf

http://www.cs.cmu.edu/~har/visapp2006.pdf

http://research.cs.wisc.edu/vision/piximilar/

http://mattmueller.me/Piximilar/paper.pdf

http://mattmueller.me/blog/creating-piximilar-image-search-by-color

```
$("input").bind("keydown", function (event) {
console.log(event.type, event);
});
```

As for Latin alphabet there was no problem and above code works well, but in case of Japanese/Chinese characters it didn’t. Keydown event didn’t occured at all when user write some text in Japanese. Fortunately, the problem can be solved quickly with following code:

```
$("input").bind("keydown input", function (event) {
console.log(event.type, event);
});
```

In case of Japanese/Chinese characters we need to listen always to two type of events: `keydown`

and `input`

.

**keydown**– occurs when key is pressed**keypress**– represents a character being typed**keyup**– after key is released**input**– fired after input field is changed**paste**– after user paste content into field

Brief explanation about key events can be found here.

As there is no need to reinvent the wheel, I found very good jQuery tagging and auto-completion plugin called Select2. This plugin replaces input box and drop down lists, allowing user to add tags, with instant wordauto-completion. Plugin was created by Ivay Berg. This plugin also have the same problem with Japanese characters, but it can be easily fixed using code above.

]]>- Random array shuffle
- Non random index shuffle – array indices are always re-organized in the same way
- Non random value shuffle – shuffle result depending on array values

This can be done with php core `shuffle()`

function in following way:

```
$data = array('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I');
shuffle($data);
var_dump($data);
```

Each time this function is called, results of shuffle will be different. Therefore this method in my case wasn’t very useful.

This method turned to be best solution as I wanted to re-organize searching results always in the same, static way.

I did some research and I find out that the best method is point reflection of odd indices of an array. It guarantees the results will be shuffled and evenly distributed across the array.

Let’s imagine we’ve got one dimensional array and we display it in a two dimensional grid. Shuffle algorithm will do as follows:

Let’s see how looks like code:

```
function array_shuffle(array $data) {
if (empty($data) || count($data) < 3) return $data;
$length = count($data);
// for each odd indices
for ($i = 1; $i < floor($length / 2); $i += 2) {
// replace item from current index with
// corresponding item in the end of an array
$tmp = $data[$length - 1 - $i];
$data[$length - 1 - $i] = $data[$i];
$data[$i] = $tmp;
}
// reset array indices
return array_values($data);
}
// usage
$data = array(
'A', 'B', 'C',
'D', 'E', 'F',
'G', 'H', 'I'
);
var_dump(array_shuffle($data));
```

In case of higher dimensions, all we have to do is to flatten an array into one-dimensional list using following function:

```
function array_flatten(array $array) {
$flatten = array();
array_walk_recursive($array, function($value) use(&$flatten) {
$flatten[] = $value;
});
return $flatten;
}
$multiDimensionalArray = array(...);
$flatArray = array_flatten($multiDimensionalArray);
var_dump(array_shuffle($flatArray));
```

And that’s all!

]]>