At this stage we have a naive bayes classifier up and running on a subset of our data set. In the following post we will walk you through our work to improve and analyze the results of the 17-class classifier described previously.
- Investigation of adding bigrams as features
- Investigation of most informative terms based on the trained Naive Bayes classifier in order to understand and explain the results
- Investigation of using the trained Naive Bayes as a multi-label classifier
- Cleaning the documents (as described in our previous post)
- Extracting tf/idf parameters for both unigram and bigram features
- Selecting 2000 features using chi-square test
- Implementing the classification pipeline described in our previous post
- Comparing the new results with our previous results
- Finding the top 10 terms for each class by selecting the features with the highest coefficients in the trained Naive Bayes Classifier
- Analyzing the effect of bigrams and the confusion matrix based on the words found in step 6.
The dataset used in this experiment is the same as last report. 8816 documents from 17 legal topics (show in in Table 1) are used here.
Note that the terms found in step 6 are different from those discussed in feature selection step. In feature selection, we use chi-square test to find the most distinguishing features in order to train the classifier. However, in this report, we analyze the trained classifier in order to find out which features have the highest weights to calculate the probability of a class in the linear model that has been formed during training. In other words, the goal of feature selection is reducing the dimension of feature vector and decreasing the expense of classification task, while the goal of this work is analyzing the results observed from a trained classifier and gaining a better understanding of the model formed during training process.
3.1 Classification Summary
Steps 1-4 described above are carried out and the following classification results are obtained. By adding bigrams to the features macro-averaged precision, recall and f1-score improved by about 5% (in comparison to the previous results).
This improvement can be explained by looking at the probability of features given a class and investigation of the top terms for each class, the top 10 of which are shown in Table 1. Note that we use 2000 features for training the classifier and the 170 terms shown in Table 1 do not cover all the features, but the most informative ones. 11 terms out of these 170 terms are b-grams, which explains how adding bigrams improves classification accuracy.
Table 1: Top 10 terms related to each class that have the highest probability given the class acquired from trained Naive Bayes classifier
|0||Administrative Law||judicial - judicial review - hearing - tribunal - review - commission - appeal - application - applicant - board|
|1||Aliens||application - minister - visa - board - canada - citizenship - officer - refugee - immigration - applicant|
|2||Bankruptcy||debtor - registrar - payment - income - debt - discharge - creditor - trustee - bankruptcy - bankrupt|
|3||Contracts||pay - clause - purchase - term - company - party - agreement - defendant - plaintiff - contract|
|4||Criminal Law||criminal code - code - judge - trial - crown - appeal - criminal - offence - sentence - accuse|
|5||Damage Awards||work - general damage - left - loss - neck - injury - damage - pain - accident - plaintiff|
|6||Damages||benefit - judge - accident - trial judge - award - defendant - trial - loss - damage - plaintiff|
|7||Family Law||mother - support - maintenance - parent - marriage - respondent - petitioner - divorce - custody - child|
|8||Food and Drug Control||notice - allegation - medicine - notice compliance - minister - noc - regulation - apotex - drug - patent|
|9||Guarantee and Indemnity||company - agreement - surety - loan - mortgage - plaintiff - defendant - bank - guarantor - guarantee|
|10||Income Tax||taxation year - taxation - appeal - income tax - tax court - appellant - minister - taxpayer - income - tax|
|11||Injunctions||balance convenience - applicant - irreparable harm - interlocutory injunction - irreparable - interlocutory - harm - defendant - plaintiff - injunction|
|12||Master and Servant||salary - termination - work - notice - employer - defendant - dismissal - employee - employment - plaintiff|
|13||Motor Vehicles||suspension - traffic - speed - highway - offence - driver - drive - motor - motor vehicle - vehicle|
|14||Municipal Law||section - power - plaintiff - land - town - municipal - council - municipality - city - bylaw|
|15||Real Property||deed - possession - defendant - easement - owner - plaintiff - lot - property - title - land|
|16||Workers' Compensation||review - employer - injury - tribunal - commission - appeal - worker compensation - compensation - board - worker|
3.2 Analysis of confusion matrix
Confusion matrix of this 17-class classifier is shown in the following. As marked on the table the highest rate of misclassification occurs for cases from class “Contracts” that are labeled as “Guarantee and Indemnity” by the trained classifier. We can observe in Table 1 that 4 out of 10 top terms are common between these two classes, which are company, agreement, defendant and plaintiff. This observation explains the high rate of misclassification between these two classes. The second highest misclassification rate happens between classes “Damages” and “Damage Awards”, for the same reason that they have many terms in common.
3.3 Investigation of using trained Naive Bayes as a multilabel classifier
Out of 8816 documents that have been used for this experiment, only 4 documents are multi-label, i.e. more than one label is assigned to the document. These documents have been replicated in all of the related classes. Although this dataset is not suitable for exploring the performance of the trained classifier for multi-label classification task, we still analyzed the output of the classifier for these 4 documents. Figure 2 shows the probabilities assigned to one of these documents by the trained classifier.
True topics of this document acquired from annotations are ['Food and Drug Control', 'Administrative Law', 'Criminal Law'] which in terms of numerical class labels is equivalent to [0,4, 8]. Trained Naive Bayes model predicts that this document belongs to the class “Administrative Law”, by assigning the highest probability to class #4. However, the classifier gives a relatively high probability to classes “Food and Drug Control” which is class #0 and “Criminal Law” which is #8 and also class #13 which is “Motor Vehicles”. This observation leads to the hypothesis that trained Naive Bayes can potentially be used as multi-label classifier, by looking at class probabilities and setting a threshold on the calculated probabilities, instead of relying only on the highest probability.
This hypothesis needs to be further investigated on the richer dataset in terms of multi-topic documents. However, this method is not straightforward to evaluate and may increase the misclassification rate.
This experiment shows that by adding bigrams to the feature set of the classifier, the classification accuracy increases without increasing the number of features. Using 2000 features from both unigrams and bigrams selected by Chi-square test, we achieved 89% average precision which is 5% better than using 2000 unigram features selected in the same manner.
We also confirmed that misclassification occurs mostly among classes that have common content, by looking at most probable terms in each class.
This experiment showed that class probabilities calculated by the trained Naive Bayes classifier may be used to produce a set of labels for multi-label documents. The feasibility of this method needs more investigated in future experiments.