Measuring document similarity in php

Similarity measurement is an important task in machine learning, used in searching engines and ranking algorithms.
Similarity can be calculated in various ways, using different mathematical models like vector space, probabilistic model or set theory.

How to do it? The first idea which came to our mind is checking whether all attributes from one object (e.g. words in document) exist in another one. Unfortunately, this method is very slow in case of large data sets, therefore more sophisticated methods are necessary.

Firstly, we need to understand that document and its contents are abstract concepts for a machine. Therefore measuring document similarity in most of the cases is about measuring distance between all of it’s attributes (for example, represented as numbers).

There are four most popular similarity measurement methods:

  • Euclidean distance
  • Cosine similarity
  • Jaccard / Tanimoto coefficient
  • Pearson Correlation

Continue readingContinue reading

Text classification using Naive Bayes Classifier

In previous post I wrote about SVM – data classification algorithm used in Machine Learning. Algorithm described there was a non-probabilistic method of classifying correlated data (data which depend on each other sometimes). This time I will write about one more classification algorithm which is called Naive Bayes Classifier. NBC is a probabilistic classifier of previously unseen data based around the Bayes theorem rule.
Bayes theoremThis rule is one of the most famous theorem in Statistics and is widely used in many fields from engineering and economics to medicine and law.

Naive Bayes Clasifier is rather simple algorithm among classification algorithms – other, more complex algorithms giving better accuracy, but if NBC is trained well on a large data set it can give surprisingly good results for much less effort.
Continue readingContinue reading

Support Vector Machines

Support Vector Machine is a machine learning algorithm used in automatic classification of previously unseen data. In this post I would like to explain how SVM works and where it’s usually used.

In general, machine learning based classification is about learning how to separate two sets of data examples. Basing on this knowledge system can correctly put unseen examples into one of the other set. Spam filter is a very good example of automatic classification system. Let’s imagine a two dimensional space with points, SVM algorithm is about finding a line (hyperplane) that separate points into two classes.


The main idea behind this algorithm is that the gap dividing points should be as wide as it’s possible.
Continue readingContinue reading