Since the onset of electronic discovery, lawyers, service providers and technologists have been grappling with the ever-growing volumes of data subject to discovery requests. Of course, larger volumes of data increase the cost of discovery at each of its primary points, from collection through production. However, it is widely agreed that the most expensive and time consuming component of the overall discovery process is the document review task itself. In response, the electronic discovery industry has introduced a range of solutions to reduce the cost and burden of document review in the “post FRCP amendment” era. Two frequently demonstrated technologies meant to shrink the number of documents requiring human review are Concept Analytics and Predictive Coding.
Concept Analytics, backed primarily by Latent Semantic Indexing, or similar, technology, has been available to reviewers for at least ten years. In short, Concept Analytics solutions offer lawyers a set of tools to find or group documents based on their meaning rather than their literal text. Concept Clustering is one example of the capabilities frequently found in Concept Analytics tool suites. There are several major uses for Concept Analytics. This article concerns its use in large-scale discovery to reduce the number of documents, whether by culling or identifying documents for inclusion, that subsequently require human review. Although litigants and counsel have shown interest in Concept Analytics from the start, service providers almost universally report a low adoption rate among their clients. However, a new crop of Concept Analytics tools with updated interfaces and better integration with review platforms is renewing excitement for the technology.
Predictive Coding, the relative new comer, relies on Machine Learning technology, especially a group of tools called Classifiers. Put simply, Predictive Coding requires lawyers to review a sampling of the document collection; the machine then learns to mimic the lawyers’ decision-making to automatically code the balance of the population. Predictive Coding is often employed only to predict whether documents are responsive or not. This is the Predictive Coding function considered in this article. When the Predictive Coding process is complete, the legal team has a choice whether or not to review the documents predicted as responsive. In many instances, if not most, the documents predicted as responsive are reviewed by humans. Applied this way, Predictive Coding functions as a culling tool. Remarkably, although there is significant industry concern regarding the defensibility of Predictive Coding, and relatively little concern about the defensibility of Concept Analytics, it is evident that Predictive Coding is gaining traction at a far greater rate. In my opinion, there are two leading reasons why this is so.
…Concept Analytics requires humans, who have proven to be inconsistent, to draw the line between responsive and unresponsive documents. With Predictive Coding, the line between responsive and unresponsive is modeled mathematically.Jonathan Kafka, iControl ESI Vice President of Operations
The inner workings of Predictive Coding tools are no more easily understood than the technology behind Concept Analytics. Here, it is important to note that the judicial debate around Predictive Coding’s defensibility does not center on how Machine Learning tools work. The focus is on the practical processes employed to train the system and statistical validation of its results. Predictive Coding relies on a well-defined workflow to move lawyers through the process. In contrast, Concept Analytics tools allow reviewers to interact with the documents more freely, which opens the door for a very subjective process. Very little about Predictive Coding is subjective because major decisions, like when to stop training, are informed by rigorous statistics. More granular, Concept Analytics requires humans, who have proven to be inconsistent, to draw the line between responsive and unresponsive documents while navigating multiple views of variable data. With Predictive Coding, the entire population is considered at once and the line between responsive and unresponsive is modeled mathematically. I submit that, even absent defensibility concerns, the desire to follow a methodical process while culling documents from the review and the need for confidence in the results are the two key drivers of Predictive Coding’s rapid acceptance by the market.
First, in order to reinforce the legitimacy of the Predictive Coding process, and to efficiently manage the training procedure, a pre-project workflow diagram is created, including plans to identify and handle problematic subpopulations. The companion piece to the workflow diagram is a rigorously detailed decision map that documents sampling methods and each coding decision made during the project, whether by human or machine, tied to specific Document IDs. The documented pre-project workflow, combined with the promise of iron-clad decision tracking, instills an initial high level of confidence in the Predictive Coding process. Further, the decision map satisfies certain potential challenges to the process’s defensibility.
Second, Predictive Coding provides statistically measurable results throughout the project. These measurements inform every aspect of the process including the size of the sample sets, the number of sample sets, how to handle problematic subpopulations and when to stop training. Training is an iterative process that must be tightly managed. Training essentially consists of human coding from which the machine learns and evaluation that determines how closely machine decisions are matching human decisions. By this process, a mathematical model is formed that consistently draws the line between responsive and unresponsive documents. Key measurements taken during training provide indication of achievable recall and precision rates. Validation steps, taken after training is completed and automated coding applied, estimate, within a defined measure of certainty, the precision and recall actually achieved by the process. In short, recall indicates how many of the responsive documents in the population have been identified; precision indicates how many of the predicted responsive documents are in fact responsive. These measurements, especially final validation, tangibly demonstrate the effectiveness of the process and bolster defensibility.
Predictive Coding is not a wizard-based “set it and forget it” solution. In fact, in the hands of inexperienced operators, Predictive Coding tools are dangerous. It is imperative that buyers select a proven technology administered by a capable provider.Jonathan Kafka
Although the reasons to use Predictive Coding are compelling, including its ability to drastically reduce the overall cost of review, it would be wrong to imply that Predictive Coding is a perfect science. There are document populations that are not well suited to Predictive Coding, either in whole or in part, and distinguishing good candidate projects from bad can be difficult. Even with a strong candidate project, squeezing the most value from a Predictive Coding project requires the support of a team well versed in Machine Learning, adept with statistics, and able to communicate the process, statistical results and practical options effectively. In its present state, Predictive Coding is not a wizard-based “set it and forget it” solution. In fact, in the hands of inexperienced operators, Predictive Coding tools are dangerous. It is imperative that buyers select a proven technology administered by a capable provider.
Yet, even with these potential pitfalls, Predictive Coding’s immediate value proposition and enormous potential all but guarantee that Predictive Coding is here to stay. Predictive Coding can help level the playing field between companies of different sizes. And it can restore sanity to discovery practices lost in the mist between guesswork and analysis.