Topology theory is already more than 100 years old but until late 20th century it was considered purely theoretical part of mathematics. However, recently it has gained attention and gradually becoming a new trend in Data Analysis.
But how this purely mathematical theory applies to the real world data? As the amount of data grows rapidly, we need to find new ways to understand it and analyse it. In some cases, good solution is to see the broader image, go to the higher level of abstraction.
Topology gives us novel approach to understand the data by analysing the shape of the multi-dimensional dataset.
In the next following articles we will go through the most important, easiest to understand parts of huge Topology Theory. We will try to build some examples based on real-world data to understand what’s amazing about it.
Topology has a lot of meanings, so many that it becomes vague what topology exactly is. Most general definition is that topology essentially studies the connectedness and continuity of various structures. Simply speaking, if two objects are somehow close to each other, we group them together.
In topology every shape is thought of to be made of clay. We can stretch it, squeeze it, and as long as we don’t make holes in it, it’s still considered to be the same object. Simply speaking, topology is about how many multi-dimensional holes a structure has.
From topological point of view a circle and a square is the same thing. Same is with a doughnut and a coffee mug! (animation on right).
Moreover, every shape can be approximated using n-dimensional “simplices“: point, line, triangle, tetrahedron, etc. We will use above properties to analyse the data.
In my previous post about clustering, I’ve been discussing different methods of dividing dataset into several distinct groups. Most of discussed methods is based mainly on proximity of the data points in some n-dimensional Euclidean space.
We’re going to use similar approach, but this time we will go step further, reconstruct the shape of the data and see how the things looks like from higher perspective. We’ll be analysing generalized shape of a dataset, something which is called a simplical complex.
Formally speaking, if we have an finite point set in Rd and we assume the data was sampled from underlying space X (manifold), the goal is to recover the topology of X.
This task can be divided into two step process:
- Geometric: approximate X with combinatorial representation (simplical complex)
- Topologic: compute topological invariant, (e.g. persistent homology)
Both tasks are quite hard and still we don’t have good, robust algorithms to solve them quickly.
Approximation of underlying space can be done in few different ways. The first thing we need to do is to build a structure called “good cover” – it’s a set of cells (vertices with radius) which covers our whole dataset in a special way. For simplicity, we’ll skip the details of this step and assume that the dataset we have is covering underlying space in a “good” way. Next step is go generate a structure similar to undirected graph which is called a “nerve”. Most commonly used nerve types are Čech complex and Vietrois-Rips complex.
Generally speaking, the Čech complex represents a group of cells by a simplex if all of them have a non-empty intersection. The Čech complex considers the intersection between cells. As a result, it always represents exactly the topology of the dataset. Vietrois-Rips complex is a subset of Cech complex, based only on computing distance threshold, which makes it much much easier to construct in practical applications.
OK, let’s build a simple nerve. We have the set of points in R2. Imagine that the shape of the underlying data space resembles the number “8” (with two holes in it). Let’s draw a circle around each point. If the distance between two points is less than some threshold points becomes connected.
If we extend the radius, the nerve will cover underlying space revealing its shape.
Most exciting part is understanding what was and what wasn’t preserved about the original data space. The shape and the size, area of the original space is not represented in the simplical complex. But on the other hand, the complex has two “holes” in it which match the two “holes” in the original shape. The rest has been “filled in” by a triangles. Another fact is that our complex is one whole component and so was the original space.
So sum up, simplical complex has preserved the number of “holes” of the original space and the connectedness of the components. In my next blog posts, it will turn out that these properties are even more important then the size or volume of original dataset.
Show me the code!