Measuring document similarity in php

Similarity measurement is an important task in machine learning, used in searching engines and ranking algorithms.
Similarity can be calculated in various ways, using different mathematical models like vector space, probabilistic model or set theory.

How to do it? The first idea which came to our mind is checking whether all attributes from one object (e.g. words in document) exist in another one. Unfortunately, this method is very slow in case of large data sets, therefore more sophisticated methods are necessary.

Firstly, we need to understand that document and its contents are abstract concepts for a machine. Therefore measuring document similarity in most of the cases is about measuring distance between all of it’s attributes (for example, represented as numbers).

There are four most popular similarity measurement methods:

  • Euclidean distance
  • Cosine similarity
  • Jaccard / Tanimoto coefficient
  • Pearson Correlation

Euclidean Distance

The simplest method of measuring similarity of two objects is to use the Euclidean distance, where each attribute of a doument is represented as a separate dimension in vector space, called preference space. Every data object becomes a point in such space.
Good example is a RGB color space, where each color (object) is represented as a point in 3 dimensional space (red, green and blue).
To measure distance between two colors we will use following equation:

euclidean

This equation can be used in another form, in order to get higher values for high similarity: 1 / (1 + d(p, q)), where d() is euclidean distance function.

This method is very fast and useful if all objects we want to compare has same number of attributes (e.g. color always has 3 attributes: RGB) and this number is not too big. This is very important because if vector space contain large amount of dimensions a strange phenomenon can appear. It is called a “Curse of dimensionality” and may have negative influence on distance measurements.

Implementation in PHP:


/**
 * Euclidean distance
 * d(a, b) = sqrt( summation{i=1,n}((b[i] - a[i]) ^ 2) )
 *
 * @param array $a
 * @param array $b
 * @return boolean
 */
public function euclidean(array $a, array $b) {
    if (($n = count($a)) !== count($b)) return false;
    $sum = 0;
    for ($i = 0; $i < $n; $i++)
        $sum += pow($b[$i] - $a[$i], 2);
    return sqrt($sum);
}

// usage:
euclidean(array(120, 255, 0), array(130, 255, 10));

Cosine Similarity

Another method, also based on vector space model is frequently used in text search engines. Mainly because it is independed from the number of attributes (space dimensions) and space sparcity – situation when the most of words in two documments are incommon.
Comparing to other methods Cosine Similarity create much more better metric for determining similarities of text documents.

Mathematicaly:
cosine

The main idea behind this similarity metric is all documents are represented as a vectors and we are trying to find cosine of angle between them. If value is closer to 1 two documents are highly similar, if value is 0 documents does not share any attributes, if value is close to -1 vectors has completely opposite direction.

In order to measure cosine similarity we need two important mathematical tools: a norm and a dot product.
Implementation in PHP is quite simple:


/**
 * Euclidean norm
 * ||x|| = sqrt(x・x) // ・ is a dot product
 *
 * @param array $vector
 * @return mixed
 */
protected function norm(array $vector) {
    return sqrt($this->dotProduct($vector, $vector));
}

/**
 * Dot product
 * a・b = summation{i=1,n}(a[i] * b[i])
 *
 * @param array $a
 * @param array $b
 * @return mixed
 */
protected function dotProduct(array $a, array $b) {
    $dotProduct = 0;
    // to speed up the process, use keys with non-empty values
    $keysA = array_keys(array_filter($a));
    $keysB = array_keys(array_filter($b));
    $uniqueKeys = array_unique(array_merge($keysA, $keysB));
    foreach ($uniqueKeys as $key) {
        if (!empty($a[$key]) && !empty($b[$key]))
            $dotProduct += ($a[$key] * $b[$key]);
    }
    return $dotProduct;
}

Using above tools we can define cosine similarity:


/**
 * Cosine similarity for non-normalised vectors
 * sim(a, b) = (a・b) / (||a|| * ||b||)
 *
 * @param array $a
 * @param array $b
 * @return mixed
 */
public function cosinus(array $a, array $b) {
    $normA = $this->norm($a);
    $normB = $this->norm($b);
    return (($normA * $normB) != 0)
           ? $this->dotProduct($a, $b) / ($normA * $normB)
           : 0;
}

// usage:
cosinus(array(1, 1, 1, 0, 3), array(2, 3, 0, 0, 1));

Above implementation is very flexible and can be used for any kind of numerical data. For example, each word of a document can be represented as it’s importance for the document. This is called Term Frequency and is a part of TFIDF – a document numerical statistic widely used in Machine Learning.

In case we don’t want to play with numbers, below I’ve prepared simple implementation working with arrays containing document words:


/**
 * Cosine similarity of sets with tokens
 * sim(a, b) = (a・b) / (||a|| * ||b||)
 *
 * @param array $a
 * @param array $b
 * @return mixed
 */
public function cosinusTokens(array $tokensA, array $tokensB) {
    $dotProduct = $normA = $normB = 0;
    $uniqueTokensA = $uniqueTokensB = array();
    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));

    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;

    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $dotProduct += $x * $y;
        $normA += $x;
        $normB += $y;
    }

    return ($normA * $normB) != 0
        ? $dotProduct / sqrt($normA * $normB)
        : 0;
}

// usage:
cosinusTokens(array('this', 'is', 'my', 'car'), array('this', 'is', 'my', 'home'));

Of course, implementation depend on needs so remember – keep it simple and straightforward!

Jaccard / Tanimoto Coefficient

As I mentioned above, every solution depend on a case and needs. In some cases each attribute has binary value which describe the absence or presence of a characteristic. Thus, the best solution will be to determine similarity through measuring intersection of both data sets.

Jaccard / Taniomoto Coefficient is a ratio of the intersecting set to the union set. It’s represented as follows:
jaccard

PHP implementation:


/**
 * Jaccard similarity index
 *
 * @param array $a
 * @param array $b
 * @return mixed
 */
public function jaccard(array $a, array $b) {
    return count($this->intersection($a, $b)) / count($this->union($a, $b));
}

/**
 * Set intersection
 *
 * @param array $a
 * @param array $b
 * @return array
 */
protected function intersection(array $a, array $b) {
    return array_values(array_intersect($a, $b));
}

/**
 * Set union
 *
 * @param array $a
 * @param array $b
 * @return array
 */
protected function union(array $a, array $b) {
    return array_values(array_unique(array_merge($a, $b)));
}

// usage:
jaccard(array('cat', 'cow', 'dog') array('cow', 'bird', 'cat'));

Links