Measuring document similarity in php

Similarity measurement is an important task in machine learning, used in searching engines and ranking algorithms.
Similarity can be calculated in various ways, using different mathematical models like vector space, probabilistic model or set theory.

How to do it? The first idea which came to our mind is checking whether all attributes from one object (e.g. words in document) exist in another one. Unfortunately, this method is very slow in case of large data sets, therefore more sophisticated methods are necessary.

Firstly, we need to understand that document and its contents are abstract concepts for a machine. Therefore measuring document similarity in most of the cases is about measuring distance between all of it’s attributes (for example, represented as numbers).

There are four most popular similarity measurement methods:

  • Euclidean distance
  • Cosine similarity
  • Jaccard / Tanimoto coefficient
  • Pearson Correlation

Continue readingContinue reading

Colour detection algorithm in PHP

Last weeks I’ve been working on colour detection algorithm so in this post I will share few ideas how to solve this problem.

First think we need to do is to choose the area where we will be checking colours. roi_sThis area is called ROI and usually it’s a rectangle located in the center of an image, covering 5/6 of it’s size. ROI is necessary because the most of images has relevant data located in the center (faces, logos, etc…), therefore colours on the edges won’t be important for us.

Continue readingContinue reading

Shuffling arrays in PHP

Recently, while working on a portfolio searching system I faced one problem with displaying results. It turned out that portfolios of same users appear close to one another. This was not good as I wanted to show variety of different users portfolios on a search result page. Therefore I needed to shuffle them somehow. Generally, in PHP there are three ways to shuffle arrays:

  • Random array shuffle
  • Non random index shuffle – array indices are always re-organized in the same way
  • Non random value shuffle – shuffle result depending on array values

Continue readingContinue reading