Thursday, March 24, 2011

Predictive Coding: Angel or Demon?

Everywhere you turn, it seems, someone is opining that predictive coding is either a blessing, or a potential risk. Like Early Case Assessment (ECA) the definition of predictive coding changes, depending upon the commentator or audience. As a result, like ECA, predictive coding is misunderstood and increasingly misused. Some think it pure technology. Others think of it as more of a complex workflow. Predictive coding really should be a combination of technology and workflow. So, what is predictive coding really? A recent Forbs Law & Technology post “EDiscovery and the Rise of Predictive Coding”, by Ben Kerschberg quotes a recent Law.com webinar:
"According to a highly informative webinar presented by Carpenter and Trenchard hosted by Law.com, predictive coding is defined by at least three defining traits. First, predictive coding leverages small samples to find other relevant documents. Second, it reduces the amount of non-relevant documents that attorneys must review and cull, leaving the reviewer to look at approximately five to 20 percent of any set of documents. And third, the results generated by predictive coding can be validated statistically.”
So, predictive coding is reviewing more substance, getting to the hot documents faster with less junk and not having to review everything? Wait, isn’t that early case assessment and data reduction that we have been doing for years you might ask? Yes, it most certainly is. The difference today being that software applications are beginning to build mathematical algorithms into their applications that serve to identify like content, streamline the organization of like documents and not just duplicate or near duplicate prior to review. These algorithms can now seek out and tag documents that are similar in content based on concepts as opposed to individual terms or strings of terms. In other words documents with similar content without being identical or nearly identical. Different software applications and service providers call this process by different names. A recent “eDiscovery Institute Survey on Predictive coding”, surveyed 11 companies that were a mix of software providers and eDiscovery services that offered predictive coding. On the topic of what to call this new animal, 8 of the 11 providers thought that their term describes the process better than predictive coding:

• Prognostic Data Profiling
• Predictive Ranking
• Relevance Assessment
• Suggestive Coding
• Predictive Categorization
• Automatic Categorization
• “Propagated Coding” or “Replicated Coding”
• Automated Document Categorization

Yikes! Just what the market needs, more market confusion. As you read this survey and each provider’s description of their version of “predictive coding” you soon learn that, like ECA, this term is being broadly applied across what are essentially data reduction techniques. You also begin to realize that “predictive coding”, like ECA, is really NOT new. Rather, it is another marketing buzz word being created to describe a new spin on an old process that is now (in some cases) being automated by technology. In fact, one service provider responded that it has been delivering predictive coding services since 2003 via use of the Attenex application (now owned by FTI). Wait, that means FCS (the company I work for) has been providing predictive coding serves since 2002 since we were the second Attenex partner! One software company, Recommind, would not describe the basis for their technology saying:
All software, processes and workflow are the proprietary intellectual property of Recommind and cannot, therefore, be disclosed.
And these guys wonder why lawyers are not falling alll over their software.  Trust me, they are not. 
Interestingly enough, everyone else described their process or technology and some in great detail. What none of them tell you is that those software applications use what are essentially mathematical algorithms that rank relevance based upon content that have been in use for decades in other industries for other purposes. When you conduct a Google search, for example, Google ranks content based upon relevance using it’s own propriety ranking algorithm. The secret sauce of these applications that use relevancy ranking, however, are usually based upon open source technology like Lucene. Yes, even Google started with some code written by someone else.  Programmers use open source code whenever possible, primarily to avoid having to pay for a license. Equivio explains their secret sauce thus: 
Equivio>Relevance enables organization of a document collection by relevance. Based on initial input from an attorney knowledgeable of the case, Equivio>Relevance uses statistical and self-learning techniques to calculate graduated relevance scores for each document in the data collection. As an expert-guided system, Equivio>Relevance works as follows: An expert reviews a sample of documents, ranking them as relevant or not. Based on the results, Equivio learns how to score documents for relevance. In an iterative, self-correcting process, Equivio feeds additional samples to the expert. These statistically generated samples allow Equivio>Relevance to progressively improve the accuracy of its relevance scoring. Once the sampling process has optimized, Equivio scores the entire collection, calculating a graduated relevance score for each document. The product includes a statistical model which monitors the software training process, ensuring validation and optimization of the sampling and training effort. 
 What Equivio has done, as have others, is to take a manual sampling process and automate the workflow. The Devil, however, is in the details. Technology like this in the wrong hands with the wrong workflow can be very dangerous. The proper human driven audits and documentation must be present; independent from what the software suggests is relevant. Relying upon a software driven audit trail simply is not enough. You must have a defensible and repeatable workflow that leverages sound technology. If you find yourself in a spot where you have to defend technology in court, you’re using the wrong workflow. As I have written here many times, technology is an organizational tool. How you organize your review is work product. As long as what you produce, or don’t produce, is based upon a process that ultimately arrives at decisions using “objective” criteria, like a transparent term, you should not have to disclose how you arrived at your production. We did not do it in those days before technology and should not place ourselves in the position of having to do so now.

There has been a lot of discussion about technology replacing human review. It has been suggested by some, including recently the New York Times, that one can even produce documents by reviewing samples and then producing like files (identified by Equivio for example) without the need for review! Interesting concept if you’re a software programmer, eDiscovery sales person or consultant that has never worked the business end of a lawsuit. However, those of us that have actual trial experience know this is a dispute waiting to happen, or worse, inadvertent production of privileged material. The number of instances of the production of privileged material is happening with greater frequency. You’ve seen the cases. It is no coincidence that the number of instances where privileged material has inadvertently been produced has increased since the use of technology has become widespread. You have a claw back agreement you say? Well, once that skunk is in the jury box, the stink is hard to remove. And what about knowing the facts of your case? You have a privilege screen using email addresses and search terms? So did many of those famous privilege waiver cases that lacked a defensible process that included actual review of what has been identified for production. Setting aside the great risk of inadvertent production of privilege documents, isn’t it important to review the documents you produce if for no other reason than to know the facts of your case? What a novel idea – learning your case by reviewing relevant documents! Remember that old cliché – a lawyer never asks a question to which he/she does not know the answer. Although, I’ve seen trial lawyers ask those questions before, making me squirm in my chair. How can any trial lawyer worth his/her salt produce documents that have never been reviewed having simply been tagged by a piece of software because they are similar to something that has been reviewed? The right approach, and fortunately the approach taken by most, is to push to review BEFORE production documents that, based on content, have the anatomy of documents that have already been reviewed and classified by human review. What is missing from many of these workflows, however, are audits. Sampling review rounds that validate no relevant document left behind. The goal is to increase the percentage of relevant documents being reviewed and reduce the number of irrelevant documents thereby reducing the cost of review. In a recent matter as an example, more than 10 million documents were available for review. Using a simple “predictive coding” workflow, only 20% of those had to be reviewed. Of those reviewed, almost 70% were relevant compared to the usual 10-15% at best when reviewing everything. All of this was accomplished without the use of software that automatically tags using a predictive coding algorithm. Rather, a workflow that utilized sampling, analytics, concepts, conversation threads and finally human driven review audits. The process is understood and driven by lawyers, not a programmer. I don’t want to send the wrong message, however. Predictive coding is not a demon.  This commentator is encouraged and excited with the advancements in technology. It is very beneficial to our ECA and data reduction process to have portions of the workflows automated. However, it is discouraging that some are applying this great technology in the wrong way and placing too much reliance upon technology that, frankly, few outside those few programmers that are using the same basic open source algorithms understand. If your using “predictive coding” technology and workflows, be certain you follow a tried and true process that does not place too much reliance on technology you will never understand. Be careful out there!

3 comments:

Roy's Common Sense said...

Mark - This is an awesome blog!

It clearly speaks to the subject. As we all know the market confusion and "vendor created mystery" around this synthetic intelligence (SI) technology often leads to its misuse rather than an organization deploying a technology properly, to solve a challenging discovery concern.

I love to prospect the technology offers, however as one that has worked the business end of a law suit, there still is no substitute for legal professionals adding gray to the discovery population.

Great Blog!

Unknown said...

I actually enjoyed reading through this posting.Many thanks.










Medical Billing and Coding Services

vadi said...


Hi, probably our entry may be off topic but anyways, I have been surfing around your blog and it looks very professional.
Litigation Coding