In 2011, I published an article titled “Slashing Discovery Budgets with Analytics.” At that time, we had been working with text analytics for several years. While I was personally convinced that the classifier technologies at Predictive Coding’s core were very likely the most beneficial text-based tools for document review, we (both the industry and iControl ESI) were still getting our arms around the potential.
Earlier this year, iControl ESI released a proprietary predictive coding tool set backed by our process consulting services. With our offering enjoying rapid adoption, it is time for me to re-visit my 2011 article with specific focus on Predictive Coding. This time, I’ll use a series of questions and answers to address the topic.
Is predictive coding a good fit for your matter?
Do you want to:
- Identify potentially responsive documents quickly?
- Minimize review of non-responsive documents?
- Minimize review cost and time?
Assuming you have more than a few thousand documents in your population and you answer these questions in the affirmative, predictive coding is likely a good approach for your review. The types of documents in your population may also impact this decision. More on that later.
Does predictive coding replace attorney review?
No… and Yes. In a traditional linear review, an attorney that is expert in the matter trains a group of contract attorneys or junior associates so that they can churn through the documents for the weeks or months that it takes to complete the review. Any predictive coding process will still need that expert attorney to review documents and “train” the system. It is the weeks and months of churning through the linear review that predictive coding minimizes.
Our goal in applying predictive coding is to achieve a review result that matches what an expert reviewer would do, if she had the time to review the entire document population AND could remain alert and effective enough to make consistent review decisions throughout the process…and to accomplish this at a fraction of the cost, in time and money. An effective predictive coding process eliminates nearly all attorney review of non-responsive documents.
How many documents does it take to adequately train the system?
While the actual answer is “it depends,” experience tells us that for the majority of reviews, the necessary training population will likely fall in the 2,500 to 5,500 document range. Generally speaking, a larger training set will lead to better classification, but we’ve found that training sets in this range allow the protocol to clear the necessary hurdles, in most cases, for a demonstrably effective review/production.
How do we know predictive coding works?
In the most common example, predictive coding helps us find responsive documents in a population that is not fully responsive. We measure the processes effectiveness primarily on two criteria:
- Recall – The fraction of the responsive documents that the process correctly identifies
- Precision – The fraction of the identified documents that truly are responsive
For example, let’s assume we have a population of 1 million documents that contains 200K responsive documents. Furthermore, we’ll assume a predictive coding process identifies 210K responsive. However, 20K of those identified are false positives, and the process only found 190K of the actual responsive documents.
In this example, our review Recall is 95% (190K/200K), and Precision is 90.5% (190K/210K).
In truth, we can use these measures to evaluate the effectiveness of any review/production process, including a traditional linear review.
So, how do we measure Recall and Precision, without examining every document in the population?
Statistics! Don’t let that scare you. The statistics involved are not really any more complicated than we see in every election or approval rating poll. With surprisingly small sample sizes, we can determine, with a very high confidence level and low margin of error, the Recall and Precision achieved by the predictive coding process.
The expert usually reviews an additional 500 to 1,000 randomly selected documents as input to this measurement. For most reviews that number of evaluation documents will provide an acceptable confidence level that the results we find in the evaluation will match the results for the entire population.
How long does predictive coding take?
While traditional human document reviews can take many months to complete, a predictive coding process, training through results and evaluation, could be done in a matter of 2 to 3 weeks, even on a population with millions of documents. Of course, opting to conduct further evaluation review and/or review of all “responsive” documents can increase this time-frame.
What is predictive coding good for?
The most common application is simple responsiveness designations. It can also be very helpful in the classification of documents according to matter issues (Issue Coding).
Our experience tells us that predictive coding is not (at least not yet) useful for privilege or confidentiality designations.
Is predictive coding defensible?
Case law is still evolving. Judge Andrew Peck of the United States District Court for the Southern District of New York has stated that computer assisted review is judicially approved for use in appropriate cases. Other courts have also weighed in, asking the parties to cooperate in establishing a predictive coding protocol.
Additionally, several cases have outlined the protocols that the parties used in those particular matters. As the courts have accepted these processes, they now serve as court approved exemplars against which we can measure the protocol we use in other cases.
Why is iControl ESI’s predictive coding better?
It has the flexibility to adapt.
There are several publicly available machine-learning algorithms developed specifically for data classification. Each algorithm has several input settings that also impact results. Other solutions typically utilize a single “one-size-fits-all” approach. While these approaches work fine for some cases, they simply cannot be optimal for all cases. It is not possible to know, in advance, which approach will work the best.
The iControl ESI predictive coding protocol allows our qualified machine learning experts and statisticians to select the appropriate algorithm and fine tune it against your specific data set and training results.
What types of datasets can predictive coding classify effectively?
The available text is the key, and more text is generally better. Your typical populations of office documents, email, attachments, etc. lend themselves well to a predictive coding process.
Largely numeric documents (like some spreadsheets), images, and very small documents (instant messages, etc) are generally risky documents that you should consider different or additional steps for. We can certainly help with those too.
On a side note, we’re working with some exciting new technologies that will potentially reduce this absolute reliance on text, but that is another topic for another day. Stay tuned!
At the end of the day, iControl ESI’s approach to predictive coding can help you:
- Understand the process and its output – making it feel like less of a “black box”
- Find the important discovery documents quickly – before the cost of a lengthy document review
- Identify the specifics of your data that may require additional attention – and help you solve for them
- Coordinate predictive coding with the other technologies you will want, or need, to use.