Topology theory is already more than 100 years old but until late 20th century it was considered purely theoretical part of mathematics. However, recently it has gained attention and gradually becoming a new trend in Data Analysis.
But how this purely mathematical theory applies to the real world data? As the amount of data grows rapidly, we need to find new ways to understand it and analyse it. In some cases, good solution is to see the broader image, go to the higher level of abstraction.
Topology gives us novel approach to understand the data by analysing the shape of the multi-dimensional dataset.
When building an application which requires fast array computation (e.g. analysing video/sound stream in real time, drawing on canvas) performance improvement becomes a very important part of the development.
There are several things we can focus on:
This time we will check which kind of loop we should choose to get the best performance.
Suppose we’re preparing a campaign for our shop. During the campaign we want to sell some new producs and focus on several customer groups. But… what are these groups?
We have some general knowledge, based mainly on daily observations, but how can we understand the whole picture?
The simplest solution is to stand in front of a shop and ask each customer what he or she likes.
Another, make an survey and… hold on, we’re living in the 21st century! Let’s solve this problem using Machine Learning.
Internet is an extremly dynamic environment for any application. Vast amounts of data and users make management difficult. Except managing we also need to protect our system from unexpected users’ behaviour or anomalies in data.
For example, if the data we want to check are static and fairly easy to predict, we can use some kind of threshold-based alerting system. But what if data we monitor depends on many conditions, or changing inconstantly across the time? Well, we will need a system which is changing together with the environment our application is living in. This is just another field where machine learning can be applied.
Similarity measurement is an important task in machine learning, used in searching engines and ranking algorithms.
Similarity can be calculated in various ways, using different mathematical models like vector space, probabilistic model or set theory.
How to do it? The first idea which came to our mind is checking whether all attributes from one object (e.g. words in document) exist in another one. Unfortunately, this method is very slow in case of large data sets, therefore more sophisticated methods are necessary.
Firstly, we need to understand that document and its contents are abstract concepts for a machine. Therefore measuring document similarity in most of the cases is about measuring distance between all of it’s attributes (for example, represented as numbers).
There are four most popular similarity measurement methods:
- Euclidean distance
- Cosine similarity
- Jaccard / Tanimoto coefficient
- Pearson Correlation
In previous post I wrote about SVM – data classification algorithm used in Machine Learning. Algorithm described there was a non-probabilistic method of classifying correlated data (data which depend on each other sometimes). This time I will write about one more classification algorithm which is called Naive Bayes Classifier. NBC is a probabilistic classifier of previously unseen data based around the Bayes theorem rule.
This rule is one of the most famous theorem in Statistics and is widely used in many fields from engineering and economics to medicine and law.
Naive Bayes Clasifier is rather simple algorithm among classification algorithms – other, more complex algorithms giving better accuracy, but if NBC is trained well on a large data set it can give surprisingly good results for much less effort.
Support Vector Machine is a machine learning algorithm used in automatic classification of previously unseen data. In this post I would like to explain how SVM works and where it’s usually used.
In general, machine learning based classification is about learning how to separate two sets of data examples. Basing on this knowledge system can correctly put unseen examples into one of the other set. Spam filter is a very good example of automatic classification system. Let’s imagine a two dimensional space with points, SVM algorithm is about finding a line (hyperplane) that separate points into two classes.
The main idea behind this algorithm is that the gap dividing points should be as wide as it’s possible.
Last weeks I’ve been working on colour detection algorithm so in this post I will share few ideas how to solve this problem.
First think we need to do is to choose the area where we will be checking colours. This area is called ROI and usually it’s a rectangle located in the center of an image, covering 5/6 of it’s size. ROI is necessary because the most of images has relevant data located in the center (faces, logos, etc…), therefore colours on the edges won’t be important for us.
Recently, while working on a portfolio searching system I faced one problem with displaying results. It turned out that portfolios of same users appear close to one another. This was not good as I wanted to show variety of different users portfolios on a search result page. Therefore I needed to shuffle them somehow. Generally, in PHP there are three ways to shuffle arrays:
- Random array shuffle
- Non random index shuffle – array indices are always re-organized in the same way
- Non random value shuffle – shuffle result depending on array values