ESI Gladiator: 2016

Thursday, November 10, 2016

Why the Pundits Failed to Predict Trump

Contributing bloggers: Susan Kavanagh and Mark Walker

Democrats and Republicans alike are looking back at this election and saying “how did we miss this?” Democrats took great pride in their “analytics,” and much of how Hillary positioned herself to the public was based upon polling and analysis of polling. Obviously, the polling was not accurate, so the entire Democratic strategy based on these analytics was – well -- way off base. In the Trump camp, Donald J. simply ignored the polling, and went with his gut. Turns out Trump’s gut was right and the Democratic pundits were wrong -- really wrong! Here’s what happened.

In order to understand at a very high level how this technology works, some background is necessary. In our world of litigation, we use analytical algorithms that were originally developed for political polling. In fact, the FBI used the very analytics we use here every day to analyze email in connection with the Clinton email investigation. That is called “eDiscovery.”

Those of us in the eDiscovery industry have struggled for years with telling family or friends “not in the business” what it is that we do for a living. Our elevator pitch for our customers, mostly attorneys, goes something like this: “We help pull ‘Electronically Stored Information’ (ESI) from your client’s servers, process it, analyze it and then use analytics so you can decide what to produce, or, if you are the receiving party, decipher what is important. Well, actually, most lawyers just know what eDiscovery is, albeit at a very high level. Actually, most don’t really want to know how the sausage is made. They are simply interested in the result.

Explaining to family and friends what eDiscovery is, is an entirely different matter. Sometimes I just say I do data forensics with stuff like email. “You do what with email?” they ask. Then the explanation would go something like “we collect electronic data from corporations, process that information and then using ’high-tech’ we help the lawyers find what’s important among what is usually stored in a great sea of information.” Most folks don’t really have any point of reference because they almost never deal with lawyers and certainly don’t have anyone looking at their email (or at least they think they don’t).

Enter the Hillary Clinton email scandal. Now we have a point of reference that pretty much everyone has heard of and directly involves eDiscovery, identical to what I do day in and day out. At some point, someone – likely a service provider like the company I work for – collected Hillary’s email from her private email server. That service provider then processed the email. When ESI is processed, information about that data is extracted from those files. A great deal of information is extracted beyond just the text of the email.

So, how is this relevant to the email scandal and the new information from the FBI? Reports vary, but Hillary appears to have produced approximately 30,000 emails to the FBI. The FBI apparently reviewed those, and decided not to recommend prosecution to the DOJ. Then, on Friday October 28, the FBI announces that it has found more email (roughly 650,000) that “might be relevant” to the Clinton email investigation, and that it “needs time” to sort it all out. This email was found on the laptop of a Hillary Clinton adviser in connection with an unrelated matter. We won’t give that unrelated matter any new press here as it is of no consequence. Of particular interest is whether there are any new emails on that laptop that are relevant to the Clinton investigation. In addition, of those that are relevant, are they new or are they just duplicates?

The FBI initially indicated that it might take months to review the data, those of us that live in the eDiscovery world know that this is a small universe of information and with the appropriate technology, should only take about 24 hours to analyze and review. As it turns out, it took the FBI only a few days to determine that there simply wasn’t anything there that changed the recommendation not to prosecute. The FBI didn’t review 650,000 emails. They simply applied analytics.

With that oversimplified explanation of how analytics are used to analyze information in litigation, how does this tell us that the pundits got it wrong? The error has to do with how those analytics were used, not a problem with the underlying algorithms. The algorithms are based upon proven mathematical science that has been used for decades. If you provide the technology with the right inputs, you get the right answer. Just like “garbage in equals garbage out,” bad input equals a bad result. When we first began using analytics, as the FBI did during its email investigation, some of us that grew up in the legal world (as opposed to the technical world) began asking questions about how the math was being applied. Specifically, some of our questions were about sampling sizes. Are we getting the appropriate samples that will let the technology “learn” and model whatever it is we are trying to attain?

ESI Advantage wrote about this problem in May 2012 – “Are your samples Random? Are you just getting random results?”

The problem is very simple in both the legal world and in the world of political polling. It is very easy to get your inputs wrong. The problem has to do with both the sample size and the actual nature of the sample. As explained more fully by reading a Ralph Losey work, or the many posts on ESIAdvantage, the problem is with how sampling is being performed. In political polling, pollsters randomly select potential voters to call and as ask a very short and list of questions that require a “Yes”, “No”, or “Undecided” answer. The “math” tells the pollster how large the sample size needs to be to meet a mathematical margin of error. Based upon the size of the population, the math tells the pollster how many potential voters need to be polled. Pollsters collect additional information about the profiles of those that were interviewed such as race, religion, how they have voted in the past, and so on. That information is input into the technology and modeling is created that should predict who’s leading the polls and the demographic of voters who are likely to vote for one candidate or the other. Those analytics help decide how to message issues directly to those that were polled.

Sample size calculation to meet a specified margin of error – say +/-3% - is based upon a large percentage of those sampled. 20% – 50% will answer Yes or No, with the remaining being categorized as “Undecided”. The composition of the sample is critically important. The sample size in political polling is usually a few thousand out of many millions, so the nature and input of those sampled is critically important. All downstream analytics are based upon those answers. Again, a bad sample equals a bad result.

In the litigation world, we are usually also sampling many millions of “documents”. In the early days, some of us immediately saw the problem: the sampling math doesn’t work. Why? The likelihood of a positive answer from a purely random sample had about a 1% chance of being relevant, if the sampling size of what the algorithm uses relies on a high “richness” of positive answers. In litigation, our “richness” is very low and the sample size calculation doesn’t work and the likelihood of getting relevant documents in your sample is very low. The technology therefore has a much harder time determining what is relevant in the population as a whole because the algorithm is making those predictions based upon the textual content and concepts contained within the documents being sampled. The legal experts have literally spent years debating this problem with the technology experts who design the tools and tweak the algorithms being used to fit our needs. Many argued early on that we need to “stack the deck” and raise our ability to locate relevant documents in our sample so that the math will work. Many technologists debated that it does not fit how the technology is designed. The approach we recommended is now referred to as “judgmental.” We select known relevant and irrelevant samples. Very often we will use search terms that will raise the level of relevant documents in our sample. This allows us to sample the right number of documents to “train” the technology about a relevant document profile, so that the analytics work. Today, the debate over “judgmental” vs “statistically random” sampling is over with just a few technologists still holding out, largely because they are still peddling outdated technology and methods.

Of course, you rely on the answers you get during sampling to be truthful, accurate and know that it doesn’t change until you take a new sample. This is because the person reviewing the sample, usually a lawyer at a very high hourly billing rate, changes their mind about what is relevant as they learn from reviewing. What is relevant can and does sometimes change dramatically. If we are using what’s called an “active learning” approach, the technology adjusts the kinds of samples being presented to the lawyers with each new sample reviewed, determining what it needs to learn based on how the system decides what changes in relevance.

So, what went wrong with the how pundits projected the election? As it turns out, there were many more “undecided” voters that the pollsters predicted and the wrong profiles were being sampled. Many of those polled either didn’t reveal that they were going to vote for Trump, or they simply changed their mind. Like we did in the early days of using analytics in litigation, the pollsters used the wrong methodology, not the wrong technology.

In litigation, most of us have resolved the problem. We were driven to do that sooner rather than later, because if we have a miss, it costs our clients millions of dollars in review costs, because we are sending too many documents deemed relevant to lawyers who are billing by the hour. We have to use the latest technology and do not have the luxury to be wrong in litigation. More importantly, we have to use the right methodology.

In summary, the Democrats were using the wrong consultants, and the wrong approach. In the months and years to come, we hope to see vast improvements in how political polling is conducted.

Tuesday, July 12, 2016

TAR - Not Just For Big Data Volume Cases

The events of the last couple of weeks have given me a great real-life example to share with you regarding Technology Assisted Review (TAR). These use-case anecdotes are right in line with our educational program this month providing education with TAR. It’s our duty to continue to educate ourselves on the technology available, and the risks and benefits of its use, and below are two great examples of instances demonstrating that TAR is not only valuable, delivering ROI, in big data volume cases, but in small ones as well.

The use of TAR and its work flows is nearly a common practice (and in fact almost mandatory in BIG data volume cases). Indeed, in our shop, we just completed a large 8.5-million record case where the lawyers reviewed only 6,000 (less than 1%) documents to achieve technology training stabilization. What is stabilization? Stabilization is the point where stability scores tell us that the technology has learned all it is likely going to learn from a sample review. Because of how well TAR worked in that case, we measured over $1.4 M in ACTUAL review cost savings just based upon what TAR indicated would not be relevant documents. The vast majority of what was identified as relevant by this process was produced without review – over 350,000 (a claw back agreement was used to protect any privileged documents produced). There were about 30,000 documents for priority custodians that had to be reviewed before production. The legal team chose to review only what TAR determined as relevant. Precision was measured at 77%. What does that mean? 77% of what the TAR process deemed relevant was in fact relevant, confirmed by human review. This precision rate is very good, and the savings remarkable, right?

Well, that wasn’t the only remarkable thing we learned about TAR this week. I ran into a lawyer at an event a few weeks back and we exchanged greetings. I gave him my business card and told him “call me if you ever need help with eDiscovery.” A week later, my phone rings and the conversation begins “I have your card here, and remember that you said call me if I need help with this “eDiscovery stuff.” He needed help indeed, and fast. He represented a client who has been sued over a trademark issue. They were sitting on the other side of a motion to compel ruling that required them to collect, filter, review and produce in less than two weeks. The attorney had a 3-person staff to get the work done and knew that the normal approach would not meet the deadline and an extension was not available. He asked if I had any idea what he should do. We were looking at a situation most shops would consider a small case with one custodian which traditionally is not a great number of documents. The attorney was from a small firm, with limited resources and budget, and limited time. I decided to advise that we treat this matter as if it were the 8.5-million record case I talked about above, and use TAR and its work flows. I am sharing with you below the steps we took. Again, this feeds directly back to my opening paragraph: Some lawyers today are not familiar with technology, which is one of the primary drivers behind the amendment to the ABA Model Rules of Conduct. In those cases, we use a defined step-by-step process to educate and inform how the process works.

The upshot in this “small” case is that the deadline was met. In fact, we were a day early. Documents reviewed – 650. Documents produced 12,211.

Step 1: Collect Data. Ooooops – we discovered the custodian in this small case had much more data than expected -- more than 300 GB! Finding more data was not conducive to meeting the tight deadline in a standard approach!

Step 2: Filter out all the file types we do not want or need – the lawyer decided to focus on a few very specific file types. Process and deduplicate. Weed out whatever we can by other judgmental means. The result – 210,000 documents remain. OK, that is better than the original collection, but way too much to review!

Step 3: The lawyer indicated he wanted to try using search terms. Result? 28,000 documents came back as hitting the terms, surprising the lawyer. What surprised us even more was that it would take north of 250 hours to review those documents. There was neither the time nor the money to follow that process. What now? Step 4!
Step 4: Enter TAR and EnvizeTM, our machine-learning tool with Active Learning. We will use the initial (completely untested) terms and run analytics just on the 28,000 documents hitting those terms. We create a few “Judgmental” random samples and launch into review/training. No control batch because EnvizeTM doesn’t need one, at least not at this stage.

Step 5: Terms are not bad – about 30% of the training documents reviewed in the first judgmental random sample were actually responsive. That is about what we expect with untested terms and exactly what we hope for to train the technology – a good mix of relevant and not relevant docs. We next created judgmental random samples to start and then use Active Learning to feed the reviewers what EnvizeTM said it needed to learn – that is the beauty of active learning.
Step 6: Stabilization occurred very quickly. Figure 1 above shows the result after 815 documents. At this point, we switch to Continuous Active Learning (CAL) to feed the reviewers highly relevant content – documents that have the highest relevance scores.
Step 7: After just a few hundred CAL docs reviewed, lawyers report that they have become confident that the technology has done its job and ask that we run priv screen and produce. We suggest QC and audits. Lawyer says – not looking for precision, just looking to make sure we are not missing anything and don’t care if we are a bit over inclusive. We ultimately review a random sample of the “left behind”, just to make sure we were not missing anything. We had not.

Step 8: DONE – everybody is happy.

Conclusion? TAR has utility beyond big-data volume cases. Almost any case of any size that has ESI can benefit from using machine learning technology and a sound TAR work flow.

Want to learn more? See the July Webinar replay here.:
TAR: A Peek Inside the Black Box.

Wednesday, May 18, 2016

What's inside the Black Box?

A recent study reported that more than half of Fortune 1000 and American Lawyer 200 attorneys noted concern about effectively defending the results of predictive coding. Predictive Coding to many is a black box. Envize^TMis the latest from iControl ESI, publisher of Recenseo. Envize^TMwill change the way you use Predictive Coding. Envize^TMallows you, the user, to see inside the black box and control the process yourself with a UI unlike any on the market. You won’t need a PhD to guide you. Envize^TMis self guided and you can use Recenseo OR your existing review tool.

iControl Intellectual property utilized in this tool is not new. iControl has been using the underlying technology in our own software for several years, only recently giving it a name and Productizing the technology for use by anyone. Envize^TMutilizes either passive or active learning, allowing you to have visibility into exactly where you are in the process and how well the technology is learning what documents you think are important. Envize^TMis based on sound and scientifically scrutinized underlying technology that has been accepted by the academic community. In 2015, the Computer Science department of Indiana and Purdue Universities co-authored and published an academic paper on iControl ESI's methods. This impressive academic paper - Batch-Mode Active Learning for Technology-Assisted Review* - was submitted to, accepted by and presented at the : IEEE Big Data 2015 Industry & Government Conference - Submission: N216. The underlying technology is the work of years of research and testing. Below is an abstract of that academic paper.

"Abstract—In recent years, technology-assisted review (TAR) has become an increasingly important component of the document review process in litigation discovery. This is fueled largely by dramatic growth in data volumes that may be associated with many matters and investigations. Potential review populations frequently exceed several hundred thousands documents, and document counts in the millions are not uncommon. Budgetary and/or time constraints often make a once traditional linear review of these populations impractical, if not impossible—which made “predictive coding” the most discussed TAR approach in recent years. A key challenge in any predictive coding approach is striking the appropriate balance in training the system. The goal is to minimize the time that Subject Matter Experts spend in training the system, while making sure that they perform enough training to achieve acceptable classification performance over the entire review population. Recent research demonstrates that Support Vector Machines (SVM) perform very well in finding a compact, yet effective, training dataset in an iterative fashion using batch-mode active learning. However, this research is limited. Additionally, these efforts have not led to a principled approach for determining the stabilization of the active learning process. In this paper, we propose and compare several batchmode active learning methods which are integrated within SVM learning algorithm. We also propose methods for determining the stabilization of the active learning method. Experimental results on a set of large-scale, real-life legal document collections validate the superiority of our method over the existing methods for this task."

You don’t need a PhD behind the scenes working levers. Envize^TMallows multiple sampling methods and easy setup.

Envize^TMprovides multiple ways to keep score, including our own Envize^TMScore that tells you exactly where you stand at any given moment.

So, What Makes This Different?

Start Training Faster (with or without control set)
Finish Training Faster (with or without control set)
Better handling of rolling population changes
Envize Automated Project Analysis and Recommendations
Better performance measure
Better Review Quality Estimates

Software and Services Since 1999

To Learn More....

Contact: mwalker@icontrolesi.com

ESI Gladiator