1. Introduction

In many research works, it has been shown that word2vec models (also referred as W2V or word embeddings) provide a very effective semantic representation of words. These representations convey meanings of words in relation to the context where these words are most likely to appear.  However, these models learn from a training dataset and can be biased (or specialized) towards the specific knowledge that is most represented in that dataset.

Since we work in a specific domain, which is understanding of legal cases, we are interested in investigating the efficacy of a W2V model that is trained on only legal cases, in comparison with the pre-trained models trained on very large dataset from various domains. The advantage of training a legal word embedding is that such a model can be specialized on legal terms and specific meaning that words canvey in legal context. On the other hand, the pre-trained models benefit from learning from huge datasets which makes them well tuned and comprehensive.

In this experiment, we will train a legal word embedding and will compare it with the pre-trained W2V model provided in spacy package.

2. Training a legal word embedding

Here we trained a skip gram W2V model using tensorflow. The dimension of this embedding space is 128 and the length of skip window is 10 (5 words left and right). Further reading about skip-gram W2V.

2.1 Training Dataset

Our corpus contains 51K judgments of the Federal Court of Canada, which can be found here. This is a relatively small dataset which is unbalanced and biased towards immigration decisions. Therefore, the outcoming model will have limitations. Results are shown for an immigration case which can be found here.

2.2 Visualization of 400 words in legal word embedding

We trained the W2V model for 5000 most frequent words of legal corpus. The vectors assigned to the first 400 vectors are shown in the following figure, using t-SNE from scikitlearn package. The actual vectors are 128-dimensional. t-SNE maps these high dimensional vectors to 2-dimensional space. It can be observed that semantically close words are neighbors in this embedding space.

Figure 1: Vectors assigned to 400 most frequent words of legal corpus in a 2-dimensional space.

Figure 1: Vectors assigned to 400 most frequent words of legal corpus in a 2-dimensional space.

3. Finding closest words

In this section we use dot product of normalized vectors to calculate the similarity between words and to find most similar words to a given term. 

3.1 Goal

Comparing general word embedding with legal embedding in finding semantically close words.

3.2 Method

1) Finding nearest words using similarity funciton of pre-trained word embedding in spacy

2) Finding near words using dot product of vectors formed in legal word embedding

3.3 Result

The following table shows the four closest words to some example words and their similarity measures.  

Legal embedding spacy
immigration ('citizenship', 0.51), ('fcj', 0.34), ('fct', 0.32), ('irpa', 0.32) ('immigrants', 0.76), ('citizenship', 0.64), ('reform', 0.64), ('legislation', 0.60)
immigrants ('ingredients', 0.32), ('refusal', 0.29), ('measure', 0.29), ('fire', 0.27) ('immigration', 0.76), ('foreigners', 0.70), ('citizens', 0.66), ('minorities', 0.66)
allowed ('dismissed', 0.52), ('costs', 0.38), ('ordered', 0.35), ('assessments', 0.37) ('allow', 0.75), ('allowing', 0.73), ('not', 0.65), ('unless', 0.65), ('they', 0.65), ('would', 0.64), ('only', 0.64), ('should', 0.64)
conclusion ('decision', 0.42), ('unreasonable', 0.40), ('finding', 0.38), ('conclusions', 0.37) ('conclude', 0.82), ('conclusions', 0.78), ('explanation', 0.68), ('contrary', 0.67)

It is observed that in legal word embedding, the word “immigration” is found to be closest to words “fcj” which is a frequently occurring Federal Court citation component and “irpa” which is the short form of the Immigration and Refugee Protection Act. This is because legal word embedding conveys specific legal meaning of the word immigration whereas the word embedding provided by spacy carries the general meaning of “immigration”. The similarity measures in legal embedding are much less than pre-trained embedding, though, which could be because of limitations of dataset or training algorithm. Also legal embedding results in much more legally relevant words for the terms “allowed”  and “conclusion” than the general embedding.

For the word “immigrant” legal embedding is not doing an impressive job, since this word is not used in legal corpus frequently.

3.4 Conclusion

Training specialized word embedding can be very useful in processing of legal cases. However, larger datasets are needed to be able to form a reliable word embedding with high similarity measures between relevant words.

4. Averaging of an entire document

In this section we will assess semantic summarization of documents by averaging of vectors assigned to nouns. We will also look into emphasizing on thematic words of a document by using tf-idf scores as averaging weights.

4.1 Goals

1) Assessing semantic averaging as a way of extracting keywords of a legal document

2) Assessing the effect of tf-idf weighting while averaging the semantic content of a document

4.2 Method

Four ways of averaging the semantic meaning of a relatively long document has been tested and compared in this experiment. Spacy is used to find the tags of words and separating nouns from other words in a document.

Averaging the general meaning of all nouns in the document

  1. Vectors of nouns were loaded from spacy and averaged.

  2. The closest words to this average were found in the vocabulary of spacy package.

Averaging the general meaning of thematic words of a document

  1. A tfidf vectorizer from scikitlearn was trained using the legal corpus.  

  2. Vectors of nouns were loaded from spacy and averaged with tfidf scores as weights. In this method nouns that are specific to a single document will be emphasized if a vector is assigned to them. These words can be related to the topic of legal case or specific issues raised in the case.  

  3. The closest words to this average were found in the vocabulary of spacy package.

Averaging legal meaning of all nouns

  1. Vectors of nouns are loaded from legal embedding space trained in section 1 and the vector are averaged.

  2. The closest words to this average were found in the vocabulary of  legal embedding space.

Averaging legal meaning of specific words of the case

  1. A tfidf vectorizer from scikitlearn was trained using the legal corpus.  

  2. Vectors of nouns are loaded from legal embedding space trained in section 1 and averaged with tfidf parameters as weights. In this method nouns that are specific to a single document will be emphasized if a vector is assigned to them. These words can be related to the topic of legal case or specific issues raised in that case. In comparison to method # 2 described above, less words have assigned vectors. Specially nonlegal words might not have vectors in this case.

  3. The closest words to this average were found in the vocabulary of  legal embedding space.

4.3 Results

Results of applying the four methods described in section 3-2 for  this case are shown in the following table.

1- General Embedding without tfidf ['whether', 'that', 'concerned', 'reasons', 'not', 'matter', 'however', 'because', 'decision', 'reason', 'fact', 'circumstances', 'if', 'would', 'regard', 'should', 'matters', 'particular', 'certain', 'there', 'concern', 'given', 'any', 'consideration', 'when', 'law', 'but', 'regardless', 'legal', 'consider', 'rather', 'same', 'clearly', 'even', 'regarding', 'what', 'possible', 'court', 'therefore', 'could']
2- General Embedding with tfidf ['decision', 'whether', 'court', 'law', 'concerned', 'officer', 'legal', 'appeal', 'enforcement', 'that', 'criminal', 'judge', 'authorities', 'matters', 'government', 'authority', 'reasons', 'federal', 'circumstances', 'immigration', 'officers', 'justice', 'informed', 'civil', 'concern', 'however', 'not', 'would', 'because', 'case', 'matter', 'police', 'concerns', 'general', 'decisions', 'responsibility', 'consideration', 'stating', 'should', 'regard']
3- Legal Embedding without tfidf ['application', 'review', 'appeal', 'irpa', 'decision', 'board', 'officer', 'visa', 'applicant', 'rpd', 'applicants', 'para', 'prra', 'judge', 'immigration', 'act', 'refugee', 'decisions', 'judicial', 'case', 'protection', 'paras', 'division', 'determination', 'boards', 'citizenship', 'matter', 'rpds', 'court', 'panel', 'status', 'permanent', 'convention', 'standard', 'issue', 'supreme', 'charter', 'iad', 'provisions', 'canada']
4- Legal Embedding with tfidf ['officer', 'irpa', 'appeal', 'applicant', 'visa', 'application', 'board', 'decision', 'immigration', 'review', 'applicants', 'rpd', 'prra', 'judge', 'refugee', 'sponsor', 'rpds', 'paras', 'para', 'permanent', 'citizenship', 'supreme', 'minister', 'convention', 'charkaoui', 'iad', 'act', 'assessment', 'removal', 'panel', 'status', 'officers', 'boards', 'request', 'division', 'determination', 'protection', 'ministers', 'judicial', 'compassionate']

4.4 Discussion

As observed in the table above both tf-idf weighting and using legal embedding instead of general embedding are two ways to extract important information from a legal document.

  • In step 1, we only find very general words. Based on these words we can cautiously guess that this is a legal document.

  • In step 2, using tf-idf weighting more legal terms show up. Here we can certainly say that it is a legal case and cautiously guess that it is an immigration case.

  • In step 3, using legal embedding space many legal words show up. Here it is obvious that this case is an immigration case. Words such as visa, rpd (Refugee Protection Division), prra (pre-removal risk assessment), iad ( Immigration Appeal Division ) make it clear that it is an immigration case.

  • In step 4, using tf-idf weighting of legal meaning of words we get very specific information from this document. 31 words are common when between step 3 and 4. However, two words “sponsor” and “Charkaoui” are the interesting terms that show up using tf-idf weighting. These words say that this case is probably about a family case which is true and says that the famous case “Charkaoui v. Canada (Citizenship and Immigration), [2008] 2 S.C.R. 326, 2008 SCC 38” was cited in this document.

5. Semantic averaging of paragraphs

Considering how much information is lost when many words are averaged, we consider  shorter segments of the documents which are paragraphs and summarize them using averaging.

5.1 Goal

1) Comparing semantic summarization of paragraphs using legal word embeddings and pre-trained word embeddings

5.2 Method

  1. Cleaning and parsing the document to paragraph level

  2. Averaging the general meaning of thematic words of paragraphs

    a. A tfidf vectorizer from scikitlearn was trained using the legal corpus.

    b.Vectors of nouns were loaded from spacy and averaged with tfidf parameters as weights, for each paragraph.  

    c. The closest words to this average were found in the vocabulary of spacy package for each paragraph
     
  3. Averaging legal meaning of specific words of the case for each paragraph

    a. A tfidf vectorizer from scikit learn was trained using the legal corpus.  

    b. Vectors of nouns are loaded from legal embedding space trained in section 1 and averaged with tfidf parameters as weights.

    c. The closest words to this average were found in the vocabulary of  legal embedding space.
     
  4. The set of words found by these two methods are visualized and compared. For both methods the vectors of words are shown in pre-trained word embedding for the sake of comparison.

5.3 Result

Here is an example of summarization of a paragraph using the methods described in section 4.2

Paragraph:

Because Duy was excluded from the family class as a non-disclosed dependant by reason of s.117(9)(d) of the Immigration and Refugee Protection Regulations, SOR/2002-227 (hereafter the Regulations), an exemption was requested on humanitarian and compassionate (H&C) grounds under s.25 of the Immigration and Refugee Protection Act, SC 2001, c 26 (hereafter the IRPA).

Results of summarization:

General Embedding with tfidf ['immigration', 'law', 'laws', 'legislation']

Legal Embedding with tfidf ['immigration', 'irpa', 'regulations', 'citizenship']

It is observed that more important information is extracted using legal word embedding than pre-trained general word embedding. The set of words extracted from all paragraphs of this case is shown for general and legal embedding in figures 2 and 3, respectively.

Figure 2: Vectors of words found by tf-idf weighted averaging of each paragraph in pre-trained word embedding space.

Figure 2: Vectors of words found by tf-idf weighted averaging of each paragraph in pre-trained word embedding space.

appeal- application- because- brother- case- chief- child- children- citizenship- concerned- consideration- court- daughter- decision- decisions- employee- enforcement- existence- family- father- ground- grounds- husband- immigrants- immigration- informed- justice- law- laws- legal- legislation- matter- matters- mother- officer- officers- officials- opinion- parents- police- principal- regard- review- right- son- that- visa- whether

Figure 3: Set of words found by tf-idf weigted averaging of each paragraph in pre-trained word embedding space.

Figure 4: Pre-trained vectors of words found by tf-idf weighted averaging of each paragraph in legal word embedding space. Note that words are found in legal word embedding but shown in pre-trained word embedding so that we can compare figures 2 and 4.

Figure 4: Pre-trained vectors of words found by tf-idf weighted averaging of each paragraph in legal word embedding space. Note that words are found in legal word embedding but shown in pre-trained word embedding so that we can compare figures 2 and 4.

appeal- applicant- applicants- application- board- canada- case- child- children- citizenship- class- considerations- decision- deschamps- exemption- existence- family- father- fcj- grounds- immigration- irpa- issue- judge- judicial- justice- member- minister- montigny- mother- officer- officers- para- paragraph- permanent- principal- question- refugee- regulations- residence- resident- review- son- sponsor- status- supra- visa

Figure 5: Set of words found by tf-idf weighted averaging of each paragraph in legal word embedding space.

applicant- board- canada- class- considerations- deschamps- exemption- fcj- irpa- issue- judge- judicial- member- minister- montigny- para- paragraph- permanent- question- refugee- regulations- residence- resident- sponsor- status- supra

Figure 6: Set of words that  Do appear using legal embedding but DO NOT appear using pre-trained word embedding

Comparing figures 2 and 4 shows that using legal embedding we can find a cluster of words that are highly relevant to immigration in a legal context. In both figures words related to family relationships are seen, since this case is about a family going through immigration process. However, in figure 4 there is an emphasis on legal terms and immigration related concepts.  Figure 6 shows the set of words that can only be found when paragraphs are summarized in legal embedding space.

6. Conclusions

  • Averaging of vector representations using pre-trained word embeddings results in loss of information and preserves only general concepts included in a document.

  • Tf-idf weighted averaging of words using pre-trained W2V representations results in much more case-specific information.

  • Training a W2V model using legal corpus in order to build a legal word embedding is an effective way of processing legal documents in a semantic level. Such a model results in vectors that convey legal meaning of words. We were able to extract some useful information about the theme of a document by averaging nouns using vectors from legal word embedding. The downside of this approach is that some specific words of a document which have important meaning might not have vectors in legal word embedding, since this model has been trained on a much smaller dataset than the pre-trained word embeddings.

  • Tf-idf weighted averaging of vectors from legal word embedding results in very specific information in a document such as cases that are cited or legal acts that are mentioned.