Similarity measurement is an important task in machine learning, used in searching engines and ranking algorithms.
Similarity can be calculated in various ways, using different mathematical models like vector space, probabilistic model or set theory.
How to do it? The first idea which came to our mind is checking whether all attributes from one object (e.g. words in document) exist in another one. Unfortunately, this method is very slow in case of large data sets, therefore more sophisticated methods are necessary.
Firstly, we need to understand that document and its contents are abstract concepts for a machine. Therefore measuring document similarity in most of the cases is about measuring distance between all of it’s attributes (for example, represented as numbers).
There are four most popular similarity measurement methods:
- Euclidean distance
- Cosine similarity
- Jaccard / Tanimoto coefficient
- Pearson Correlation