In many research works, it has been shown that word2vec models (also referred as W2V or word embeddings) provide a very effective semantic representation of words. These representations convey meanings of words in relation to the context where these words are most likely to appear. However, these models learn from a training dataset and can be biased (or specialized) towards the specific knowledge that is most represented in that dataset.
Since we work in a specific domain, which is understanding of legal cases, we are interested in investigating the efficacy of a W2V model that is trained on only legal cases, in comparison with the pre-trained models trained on very large dataset from various domains. The advantage of training a legal word embedding is that such a model can be specialized on legal terms and specific meaning that words canvey in legal context. On the other hand, the pre-trained models benefit from learning from huge datasets which makes them well tuned and comprehensive.
In this experiment, we will train a legal word embedding and will compare it with the pre-trained W2V model provided in spacy package.
2. Training a legal word embedding
Here we trained a skip gram W2V model using tensorflow. The dimension of this embedding space is 128 and the length of skip window is 10 (5 words left and right). Further reading about skip-gram W2V.
2.1 Training Dataset
Our corpus contains 51K judgments of the Federal Court of Canada, which can be found here. This is a relatively small dataset which is unbalanced and biased towards immigration decisions. Therefore, the outcoming model will have limitations. Results are shown for an immigration case which can be found here.
2.2 Visualization of 400 words in legal word embedding
We trained the W2V model for 5000 most frequent words of legal corpus. The vectors assigned to the first 400 vectors are shown in the following figure, using t-SNE from scikitlearn package. The actual vectors are 128-dimensional. t-SNE maps these high dimensional vectors to 2-dimensional space. It can be observed that semantically close words are neighbors in this embedding space.
3. Finding closest words
In this section we use dot product of normalized vectors to calculate the similarity between words and to find most similar words to a given term.
Comparing general word embedding with legal embedding in finding semantically close words.
1) Finding nearest words using similarity funciton of pre-trained word embedding in spacy
2) Finding near words using dot product of vectors formed in legal word embedding
The following table shows the four closest words to some example words and their similarity measures.
|immigration||('citizenship', 0.51), ('fcj', 0.34), ('fct', 0.32), ('irpa', 0.32)||('immigrants', 0.76), ('citizenship', 0.64), ('reform', 0.64), ('legislation', 0.60)|
|immigrants||('ingredients', 0.32), ('refusal', 0.29), ('measure', 0.29), ('fire', 0.27)||('immigration', 0.76), ('foreigners', 0.70), ('citizens', 0.66), ('minorities', 0.66)|
|allowed||('dismissed', 0.52), ('costs', 0.38), ('ordered', 0.35), ('assessments', 0.37)||('allow', 0.75), ('allowing', 0.73), ('not', 0.65), ('unless', 0.65), ('they', 0.65), ('would', 0.64), ('only', 0.64), ('should', 0.64)|
|conclusion||('decision', 0.42), ('unreasonable', 0.40), ('finding', 0.38), ('conclusions', 0.37)||('conclude', 0.82), ('conclusions', 0.78), ('explanation', 0.68), ('contrary', 0.67)|
It is observed that in legal word embedding, the word “immigration” is found to be closest to words “fcj” which is a frequently occurring Federal Court citation component and “irpa” which is the short form of the Immigration and Refugee Protection Act. This is because legal word embedding conveys specific legal meaning of the word immigration whereas the word embedding provided by spacy carries the general meaning of “immigration”. The similarity measures in legal embedding are much less than pre-trained embedding, though, which could be because of limitations of dataset or training algorithm. Also legal embedding results in much more legally relevant words for the terms “allowed” and “conclusion” than the general embedding.
For the word “immigrant” legal embedding is not doing an impressive job, since this word is not used in legal corpus frequently.
Training specialized word embedding can be very useful in processing of legal cases. However, larger datasets are needed to be able to form a reliable word embedding with high similarity measures between relevant words.
4. Averaging of an entire document
In this section we will assess semantic summarization of documents by averaging of vectors assigned to nouns. We will also look into emphasizing on thematic words of a document by using tf-idf scores as averaging weights.
1) Assessing semantic averaging as a way of extracting keywords of a legal document
2) Assessing the effect of tf-idf weighting while averaging the semantic content of a document
Four ways of averaging the semantic meaning of a relatively long document has been tested and compared in this experiment. Spacy is used to find the tags of words and separating nouns from other words in a document.
Averaging the general meaning of all nouns in the document
Vectors of nouns were loaded from spacy and averaged.
The closest words to this average were found in the vocabulary of spacy package.
Averaging the general meaning of thematic words of a document
A tfidf vectorizer from scikitlearn was trained using the legal corpus.
Vectors of nouns were loaded from spacy and averaged with tfidf scores as weights. In this method nouns that are specific to a single document will be emphasized if a vector is assigned to them. These words can be related to the topic of legal case or specific issues raised in the case.
The closest words to this average were found in the vocabulary of spacy package.
Averaging legal meaning of all nouns
Vectors of nouns are loaded from legal embedding space trained in section 1 and the vector are averaged.
The closest words to this average were found in the vocabulary of legal embedding space.
Averaging legal meaning of specific words of the case
A tfidf vectorizer from scikitlearn was trained using the legal corpus.
Vectors of nouns are loaded from legal embedding space trained in section 1 and averaged with tfidf parameters as weights. In this method nouns that are specific to a single document will be emphasized if a vector is assigned to them. These words can be related to the topic of legal case or specific issues raised in that case. In comparison to method # 2 described above, less words have assigned vectors. Specially nonlegal words might not have vectors in this case.
- The closest words to this average were found in the vocabulary of legal embedding space.
Results of applying the four methods described in section 3-2 for this case are shown in the following table.
|1- General Embedding without tfidf||['whether', 'that', 'concerned', 'reasons', 'not', 'matter', 'however', 'because', 'decision', 'reason', 'fact', 'circumstances', 'if', 'would', 'regard', 'should', 'matters', 'particular', 'certain', 'there', 'concern', 'given', 'any', 'consideration', 'when', 'law', 'but', 'regardless', 'legal', 'consider', 'rather', 'same', 'clearly', 'even', 'regarding', 'what', 'possible', 'court', 'therefore', 'could']|
|2- General Embedding with tfidf||['decision', 'whether', 'court', 'law', 'concerned', 'officer', 'legal', 'appeal', 'enforcement', 'that', 'criminal', 'judge', 'authorities', 'matters', 'government', 'authority', 'reasons', 'federal', 'circumstances', 'immigration', 'officers', 'justice', 'informed', 'civil', 'concern', 'however', 'not', 'would', 'because', 'case', 'matter', 'police', 'concerns', 'general', 'decisions', 'responsibility', 'consideration', 'stating', 'should', 'regard']|
|3- Legal Embedding without tfidf||['application', 'review', 'appeal', 'irpa', 'decision', 'board', 'officer', 'visa', 'applicant', 'rpd', 'applicants', 'para', 'prra', 'judge', 'immigration', 'act', 'refugee', 'decisions', 'judicial', 'case', 'protection', 'paras', 'division', 'determination', 'boards', 'citizenship', 'matter', 'rpds', 'court', 'panel', 'status', 'permanent', 'convention', 'standard', 'issue', 'supreme', 'charter', 'iad', 'provisions', 'canada']|
|4- Legal Embedding with tfidf||['officer', 'irpa', 'appeal', 'applicant', 'visa', 'application', 'board', 'decision', 'immigration', 'review', 'applicants', 'rpd', 'prra', 'judge', 'refugee', 'sponsor', 'rpds', 'paras', 'para', 'permanent', 'citizenship', 'supreme', 'minister', 'convention', 'charkaoui', 'iad', 'act', 'assessment', 'removal', 'panel', 'status', 'officers', 'boards', 'request', 'division', 'determination', 'protection', 'ministers', 'judicial', 'compassionate']|
As observed in the table above both tf-idf weighting and using legal embedding instead of general embedding are two ways to extract important information from a legal document.
In step 1, we only find very general words. Based on these words we can cautiously guess that this is a legal document.
In step 2, using tf-idf weighting more legal terms show up. Here we can certainly say that it is a legal case and cautiously guess that it is an immigration case.
In step 3, using legal embedding space many legal words show up. Here it is obvious that this case is an immigration case. Words such as visa, rpd (Refugee Protection Division), prra (pre-removal risk assessment), iad ( Immigration Appeal Division ) make it clear that it is an immigration case.
In step 4, using tf-idf weighting of legal meaning of words we get very specific information from this document. 31 words are common when between step 3 and 4. However, two words “sponsor” and “Charkaoui” are the interesting terms that show up using tf-idf weighting. These words say that this case is probably about a family case which is true and says that the famous case “Charkaoui v. Canada (Citizenship and Immigration),  2 S.C.R. 326, 2008 SCC 38” was cited in this document.
5. Semantic averaging of paragraphs
Considering how much information is lost when many words are averaged, we consider shorter segments of the documents which are paragraphs and summarize them using averaging.
1) Comparing semantic summarization of paragraphs using legal word embeddings and pre-trained word embeddings
Cleaning and parsing the document to paragraph level
Averaging the general meaning of thematic words of paragraphsa. A tfidf vectorizer from scikitlearn was trained using the legal corpus.
b.Vectors of nouns were loaded from spacy and averaged with tfidf parameters as weights, for each paragraph.
c. The closest words to this average were found in the vocabulary of spacy package for each paragraph
Averaging legal meaning of specific words of the case for each paragraphb. Vectors of nouns are loaded from legal embedding space trained in section 1 and averaged with tfidf parameters as weights.
a. A tfidf vectorizer from scikit learn was trained using the legal corpus.
c. The closest words to this average were found in the vocabulary of legal embedding space.
The set of words found by these two methods are visualized and compared. For both methods the vectors of words are shown in pre-trained word embedding for the sake of comparison.
Here is an example of summarization of a paragraph using the methods described in section 4.2
Because Duy was excluded from the family class as a non-disclosed dependant by reason of s.117(9)(d) of the Immigration and Refugee Protection Regulations, SOR/2002-227 (hereafter the Regulations), an exemption was requested on humanitarian and compassionate (H&C) grounds under s.25 of the Immigration and Refugee Protection Act, SC 2001, c 26 (hereafter the IRPA).
Results of summarization:
General Embedding with tfidf ['immigration', 'law', 'laws', 'legislation']
Legal Embedding with tfidf ['immigration', 'irpa', 'regulations', 'citizenship']
It is observed that more important information is extracted using legal word embedding than pre-trained general word embedding. The set of words extracted from all paragraphs of this case is shown for general and legal embedding in figures 2 and 3, respectively.
|appeal- application- because- brother- case- chief- child- children- citizenship- concerned- consideration- court- daughter- decision- decisions- employee- enforcement- existence- family- father- ground- grounds- husband- immigrants- immigration- informed- justice- law- laws- legal- legislation- matter- matters- mother- officer- officers- officials- opinion- parents- police- principal- regard- review- right- son- that- visa- whether|
Figure 3: Set of words found by tf-idf weigted averaging of each paragraph in pre-trained word embedding space.
|appeal- applicant- applicants- application- board- canada- case- child- children- citizenship- class- considerations- decision- deschamps- exemption- existence- family- father- fcj- grounds- immigration- irpa- issue- judge- judicial- justice- member- minister- montigny- mother- officer- officers- para- paragraph- permanent- principal- question- refugee- regulations- residence- resident- review- son- sponsor- status- supra- visa|
Figure 5: Set of words found by tf-idf weighted averaging of each paragraph in legal word embedding space.
|applicant- board- canada- class- considerations- deschamps- exemption- fcj- irpa- issue- judge- judicial- member- minister- montigny- para- paragraph- permanent- question- refugee- regulations- residence- resident- sponsor- status- supra|
Figure 6: Set of words that Do appear using legal embedding but DO NOT appear using pre-trained word embedding
Comparing figures 2 and 4 shows that using legal embedding we can find a cluster of words that are highly relevant to immigration in a legal context. In both figures words related to family relationships are seen, since this case is about a family going through immigration process. However, in figure 4 there is an emphasis on legal terms and immigration related concepts. Figure 6 shows the set of words that can only be found when paragraphs are summarized in legal embedding space.
Averaging of vector representations using pre-trained word embeddings results in loss of information and preserves only general concepts included in a document.
Tf-idf weighted averaging of words using pre-trained W2V representations results in much more case-specific information.
Training a W2V model using legal corpus in order to build a legal word embedding is an effective way of processing legal documents in a semantic level. Such a model results in vectors that convey legal meaning of words. We were able to extract some useful information about the theme of a document by averaging nouns using vectors from legal word embedding. The downside of this approach is that some specific words of a document which have important meaning might not have vectors in legal word embedding, since this model has been trained on a much smaller dataset than the pre-trained word embeddings.
- Tf-idf weighted averaging of vectors from legal word embedding results in very specific information in a document such as cases that are cited or legal acts that are mentioned.