Text classification using Naive Bayes Classifier

In previous post I wrote about SVM – data classification algorithm used in Machine Learning. Algorithm described there was a non-probabilistic method of classifying correlated data (data which depend on each other sometimes). This time I will write about one more classification algorithm which is called Naive Bayes Classifier. NBC is a probabilistic classifier of previously unseen data based around the Bayes theorem rule.
Bayes theoremThis rule is one of the most famous theorem in Statistics and is widely used in many fields from engineering and economics to medicine and law.

Naive Bayes Clasifier is rather simple algorithm among classification algorithms – other, more complex algorithms giving better accuracy, but if NBC is trained well on a large data set it can give surprisingly good results for much less effort.

Using Bayes theorem

Bayes rule is a way of looking at conditional probabilities that allows to flip the condition around in a convenient way. A conditional probability normally written as P(A|B) is a probably that event A will occur, given the evidence B.
What makes it “naive” is that we make an assumption that all observed values are independent from each other. For example – in case of natural language, algorithm won’t understand the connection between words “delicious” and “cake” which in real world appear close to each other. Nevertheless, as I mentioned above – if system is trained well, results can be much more better than we expect.

Let’s imagine we want to detect language of a document – we will need to calculate the probability that a given document belongs to a given language (class). This probability can be written as follows:

Document - class probability

Where:

  • p(S|D) is a probability that document D belongs to class S
  • p(S) is the probability of class S
  • p(wi|S) – probability of each token (word) from document appear in class S
  • p(D) – probability of document D

As we don’t need to calculate a precise probability (just only rank classes from highest to lowest probable) and the probability P(D) is always the same, we can drop it and rewrite the rule:

9aff3b4958fdb888e5391a8f710a4adc2

Let’s sum up – in order to count score of document D belonging to class S we need to calculate:

equation

Where ti is a token (word) from document D. For more simplicity, we assumed that probability p(S) of each class is equal. The extra 1 and count(all tokens) is called Laplace smoothing  and prevent from multiplying by zero. If we didn’t have it any document with an unseen token in it would score zero.

Give me some code

OK, after we prepared mathematical model it’s time to write some code. First of all we need to train our classifier – provide big amount of documents, each tagged with class. Next we will need to tokenise document (divide into separate words) and count occurrences of each token in document, total number of tokens, documents, classes, etc.


public function train(IDataSource $dataSource, $class) {
    // class initialization
    if (!in_array($class, $this->classes)) {
        $this->classes[] = $class;
        $this->classTokenCounter[$class] = 0;
        $this->classDocumentCounter[$class] = 0;
    }

    // train class using provided documents
    while ($document = $dataSource->getNextDocument()) {
        $this->documentCounter++;
        $this->classDocumentCounter[$class]++;

        // add all documents tokens to global vocabulary
        foreach ($this->tokenise($document) as $token) {
            $this->vocabulary[$token][$class] = 
                isset($this->vocabulary[$token][$class])
                ? $this->vocabulary[$token][$class] + 1
                : 1;
            $this->classTokenCounter[$class]++;
            $this->tokenCounter++;
        }
    }
}

private function tokenise($text) {
    return preg_split('/\s+/', mb_strtolower($text));
}

After we successfully trained our classifier, we can classify text by calculating it’s score as I explained above.


public function classify($document, $showProbabilities = false) {

    $tokens = $this->tokenise($document);
    $posteriors = array();

    // for each class count posterior probability
    foreach ($this->classes as $class) {
        $posteriors[$class] = $this->posterior($tokens, $class);
    }

    arsort($posteriors);
    return ($showProbabilities) ? $posteriors : key($posteriors);
}

private function posterior($tokens, $class) {
    $posterior = 1;
    foreach ($tokens as $token) {
        $count = isset($this->vocabulary[$token][$class])
            ? $this->vocabulary[$token][$class]
            : 0;
        // multiply by token probability, add Laplace smoothing
        $posterior *= ($count + 1) / ($this->classTokenCounter[$class] 
            + $this->tokenCounter);
    }
    $posterior = $this->prior($class) * $posterior;
    return $posterior;
}

private function prior($class) {
    return $this->classDocumentCounter[$class] / $this->documentCounter;
}

Whole source code with working example can be found on my github account:
https://github.com/LukaszKrawczyk/PHPNaiveBayesClassifier

Happy classifying!

Links