Editor’s Note: The author of this post leads product teams to develop legal technology.
By Mark Kerzner, Chief Product Architect, LexInnova
The term “Big Data” is so pervasive and so full of unknowns that it often sounds like “Boo! Data!” Along these lines, you find articles that talk about scary repercussions of Big Data for law, starting with the fact that much more data will be available for collection, and ending with industries being ruined in the wake of automation. On the other extreme, you find optimistic articles vaguely claiming that “Big Data will solve all our problems,” including Facebook outages and those that we even know not of. Very often these extremes are due to lack of clear understanding of what Big Data is, and how it relates to legal industry.
In this article I will define exactly — and with examples — what is machine learning, what is artificial intelligence, and how they are relevant to the problems that lawyers face. This will allow the reader to judge for herself and to form an opinion on what they portend for the legal industry.
“Machine learning” grew out of statistics. For example, if we take a Patent Application Backlog from the USPTO site and draw a straight line approximating the last year numbers, we can “predict” that by the end of 2015 the backlog will be around 530,000 to 550,000 applications.
How accurate is such prediction? It depends on two factors. Firstly, it assumes that “what was is what will be.” Secondly, I only went back for a year. If we took the prior year into account, we might have come up with a cyclical curve prediction, resulting in a different possible outcome.
I have intentionally oversimplified. Machine learning really deals with more complex data. For example, it may want to consider not just the month and the backlog size, but also the number of examiners. This would give us a three-dimensional chart, and a two-dimensional plane to approximate the trend. Or it may deal with four or more dimensions.
Finding the best approximation will then be harder than just fitting a line to a few points. It will require a computational algorithm. These algorithms are are grouped in the area called “mathematical optimization.” Thus, machine learning is a combination of advanced statistics and mathematical optimization. Nevertheless, the picture above will be always helpful.
A “yes” or “no” decision
Now let us look at an example of real machine learning. Say we are looking for a way to filter out spam emails. We can start by collecting a good number of emails (they will be called a training set), and by manually marking those that are spam. Now we need to describe emails by using attributes, such as the length of the email, the presence and frequently used keywords indicating spam, place of origination, etc. Usually you find a few dozen of such characteristics, which can be expressed by a number. And then we need to find a “line of separation,” which is drawn in such a way that all spam emails will stay on one side of the line, and the good emails will be on the other.
This “simple” approach has a pretty complicated name: Support Vector Machines (SVM). Even experts need an explanation of why it is called this way. The fact that we need to manually label all emails makes it “supervised” machine learning. The real algorithm is more complex, and the points are in a multi-dimensional space. (In real life spam detection is usually done with another algorithm, called Naive Bayes, but SVM provides a very good example and illustration).
As you can see, the idea is pretty simple though, and it works surprisingly well. We all know that the spam filters today are pretty good. What is unexpected is that the same idea can be used to “predict” whether a given document will be privileged or no, and can form a basis for automated privilege review.
Privileged document is one that is protected from disclosure during eDiscovery. There are a few reasons for it to be privileged: communication between lawyer and client is the most common, but there are also others, such as doctor-patient and husband-wife privileges. Finding out whether a certain email is privileged is the same yes-or-no decision as spam-or-not classification, and can be solved with the same machine learning called Support Vector Machines as above.
However, don’t get your hopes up too high as yet. Even if you were to have a huge training set — and spam filters have a very large large ones, and still get things wrong sometimes — you can only promise a “high probability”. One can argue that humans make even more mistakes than this, but the fact remains that machine learning does not really “understand” the documents. It does not even imitate this. It is just using statistics to come up with a good probability estimate.
Converting documents to numbers
There are other methods of machine learning. For example a common request in legal document review is “give me more documents like this.” This is accomplished in the following way: all documents are “clustered,” that is, assigned to groups of similar documents. More documents are provided from the cluster to which the current one belongs.
In order to measure document similarity, we need to convert them to numbers. One of the ways of doing this is based on term frequency. Let us say a document of a hundred words has the word “contract” occurring ten times in it. The frequency of this term is 10%. This is pretty high, and we might think that the document is about contracts. However, we must also consider the overall frequency of the term “contract” in the complete collection of our documents. Say, for example, that the overall frequency of the word “contract” in our collection is only 5%. Then this document indeed stands out as a “contract-related.”
In the same way, we can compute these frequencies for all words in all documents. To make it sound more scientific, this calculation is called TF-IDF, or “term frequency – inverse document frequency”, and more complex frequency formulas are used.
The questions now becomes, how close is this to document understanding? I think that you will conclude with me that it is far. We are just computing some statistics, and declare documents similar if they have similar statistics.
This solution leaves statisticians and data scientists quite happy, because it is elegant and familiar to them, but lawyers may understandably remain distrustful. The solution does not explain why and how it arrived at the result. Even from the mathematical point of view it leaves something to be desired. If three words, “powerful,” “strong” and “Paris” occur in the same document, our method sees that as equally distant, although the first two are much closer together. Moreover, the order of words is completely lost. A nonsensical document with the same words that are all mixed up with look in every way the same as the original document. We will be back to these problems with machine learning later, but for now we can conclude that
- Machine learning is based on statistics
- Machine learning is an art, and can be applied differently to by different people, leading to different results
- It is not based on linguistic information but on word frequencies and other similar measures.
Training in machine learning
Before we proceed to artificial intelligence, we need to touch on “training.” In machine learning, “training” means labeling the data with desired classification. For example, in the case of spam detection, it means assigning the category of “spam” or “not spam” to each of the emails which we use to “train” our system. Training also includes finding the parameters which will best approximate our classification. For example, in the case of spam, it would be that red line that separates the two groups of points.
In case of legal review, training may include selecting say 10% of the complete document set and allowing a group of human reviewers to label them as “relevant” or “not relevant”, perhaps “hot” or “not hot”, and so on. Then we “train” the model, and allow the computer to complete the classification for the remaining 90% of the collection.
Apart from the machine learning considerations above, this training raises two questions:
- The reviewers have seen just a small subset of all the documents, chosen perhaps at random. Should they see the complete selection, they might have labeled the training set differently.
- The training set might not be representative of the complete document set. That means that the directions that the computer receives during this training will be incomplete and may result in incorrect classification.
The promise of machine learning
Please don’t get me wrong. Mathematically the model is correct, and we can always tell the lawyers that “if you want a better quality, expressed as probability of correct classification, choose a bigger training set.”
Machine learning is useful and it is the best offering that we have available. There are many more algorithms in machine learning than the examples I have shown. You can further improve it by having people knowledgeable in machine learning help you with your data analytics and with the use of the existing tools.
However, I hope that in the examples above you can also see its limitation. Everybody will agree that the computers have the potential of getting ever smarter, and that their applications in law still have a long way to go.
So, how does AI work? As one example let’s take grammar analysis.
Artificial Intelligence, or AI, is broader than machine learning. It deals with logical reasoning and with rules that humans use. It tries to imitate what humans do, or to emulate it, while using perhaps a different way of achieving it than humans.
Machine learning is usually considered part of AI, and AI may use some machine learning algorithms as part of a specific solution. A classical example of AI is IBM’s Deep Blue chess champion, and lately it is Dr. Watson, a system that won over human champions in playing the Jeopardy game. Dr. Watson combined the reading of all Wikipedia and search technology with the latest in Natural Language Processing and some machine learning. It is definitely AI, and this technology is now offered in the IBM cloud.
So, how does AI work? As one example let’s take grammar analysis. AI algorithms detect verbs and nouns, sentences and sentence structure, and can attempt to formulate the meaning of the text. One of the tools that the AI researchers use is a tool called “General Architecture for Text Engineering,” or GATE.
In this screenshot GATE show the dates, people, organizations, locations and sentences that it has detected. You can immediately see that this approach aims at understanding, or at least imitating understanding, and is very promising. GATE is still based on rules, without an admixture of statistics.
Pure Natural Language Processing, or NLP, algorithms have their own problems. They require more work than machine learning, and they are usually targeted at very specific applications. Moreover, they are still fragile. Why? Let me give an example.
The early attempt at machine translation was undertaken by IBM in 2008. They have used a few million of verified English/French parallel texts taken from the transcripts of the Canadian parliament. The effort showed much promise but could not reach more than 80-85% accuracy. It exemplified “small data” thinking: using small but clean and verified data to produce the algorithms.
Current leader in machine translation, Google, took the Big Data approach. They have also used the NLP, but they added statistical methods, and the training set of data was all world’s data. The result is well-known as https://translate.google.com/. It also shows another Big Data trait: it learns from the users’ critique of incorrect translations.
As it turns out, the most robust systems involve a combination of NLP and advanced machine learning. Google has recently published an article on how this is done. If you remember, machine learning algorithms which were used for text analysis ignored word order. Google researchers suggested an algorithms that would take the word order and the context into account. They called it “Paragraph Vectors” (see diagram) to indicate that the analysis is on the level of whole paragraphs. This approach is implemented in machine translation and it beats all other known algorithms.
What does this mean for the legal industry?
My goal in this article was to show exactly what machine learning is, and what its limitations are. Then we looked at AI, with its promises. Now let me formulate my conclusions.
- Most of the current TAR (technology assisted review) in eDiscovery implements existing machine learning algorithms. This is helpful but limited, and to achieve the best possible results it helps to have the assistance of a data scientist.
- Artificial Intelligence approach requires more work, and results in systems for very specific applications. However, it allows unprecedented precision, such as Google machine translation.
- The most prominent part of AI applicable to legal problems is called text analytics. The greatest quality here comes from combining the NLP with statistical analysis, machine learning and using all data (Big Data) as training sets.
The amount of data available for eDiscovery and analysis continues to grow. One is hard pressed to do review without the use of the analytics tools, and will need these tools even more in the near future. Just as in a regular war, where the quality of one’s weapons is of extreme importance, so in future legal battles the success will be strongly influenced by the analytics tools that the parties will use, and by their aptitude of applying these tools to the matter at hand.