Thursday, November 10, 2016


Why the Pundits Failed to Predict Trump

Contributing bloggers:  Susan Kavanagh and Mark Walker

Democrats and Republicans alike are looking back at this election and saying “how did we miss this?”  Democrats took great pride in their “analytics,” and much of how Hillary positioned herself to the public was based upon polling and analysis of polling.  Obviously, the polling was not accurate, so the entire Democratic strategy based on these analytics was – well -- way off base. In the Trump camp, Donald J. simply ignored the polling, and went with his gut.  Turns out Trump’s gut was right and the Democratic pundits were wrong -- really wrong!  Here’s what happened.
In order to understand at a very high level how this technology works, some background is necessary.  In our world of litigation, we use analytical algorithms that were originally developed for political polling.  In fact, the FBI used the very analytics we use here every day to analyze email in connection with the Clinton email investigation.  That is called “eDiscovery.” 
Those of us in the eDiscovery industry have struggled for years with telling family or friends “not in the business” what it is that we do for a living.  Our elevator pitch for our customers, mostly attorneys, goes something like this: “We help pull ‘Electronically Stored Information’ (ESI) from your client’s servers, process it, analyze it and then use analytics so you can decide what to produce, or, if you are the receiving party, decipher what is important.  Well, actually, most lawyers just know what eDiscovery is, albeit at a very high level.  Actually, most don’t really want to know how the sausage is made.  They are simply interested in the result.
Explaining to family and friends what eDiscovery is, is an entirely different matter.  Sometimes I just say I do data forensics with stuff like email.  “You do what with email?” they ask. Then the explanation would go something like “we collect electronic data from corporations, process that information and then using ’high-tech’ we help the lawyers find what’s important among what is usually stored in a great sea of information.” Most folks don’t really have any point of reference because they almost never deal with lawyers and certainly don’t have anyone looking at their email (or at least they think they don’t). 
Enter the Hillary Clinton email scandal.  Now we have a point of reference that pretty much everyone has heard of and directly involves eDiscovery, identical to what I do day in and day out.  At some point, someone – likely a service provider like the company I work for – collected Hillary’s email from her private email server.  That service provider then processed the email.  When ESI is processed, information about that data is extracted from those files.  A great deal of information is extracted beyond just the text of the email.
So, how is this relevant to the email scandal and the new information from the FBI?  Reports vary, but Hillary appears to have produced approximately 30,000 emails to the FBI.  The FBI apparently reviewed those, and decided not to recommend prosecution to the DOJ.  Then, on Friday October 28, the FBI announces that it has found more email (roughly 650,000) that “might be relevant” to the Clinton email investigation, and that it “needs time” to sort it all out.  This email was found on the laptop of a Hillary Clinton adviser in connection with an unrelated matter.  We won’t give that unrelated matter any new press here as it is of no consequence. Of particular interest is whether there are any new emails on that laptop that are relevant to the Clinton investigation.  In addition, of those that are relevant, are they new or are they just duplicates? 
The FBI initially indicated that it might take months to review the data, those of us that live in the eDiscovery world know that this is a small universe of information and with the appropriate technology, should only take about 24 hours to analyze and review. As it turns out, it took the FBI only a few days to determine that there simply wasn’t anything there that changed the recommendation not to prosecute.  The FBI didn’t review 650,000 emails.  They simply applied analytics.
With that oversimplified explanation of how analytics are used to analyze information in litigation, how does this tell us that the pundits got it wrong? The error has to do with how those analytics were used, not a problem with the underlying algorithms.  The algorithms are based upon proven mathematical science that has been used for decades. If you provide the technology with the right inputs, you get the right answer. Just like “garbage in equals garbage out,” bad input equals a bad result.  When we first began using analytics, as the FBI did during its email investigation, some of us that grew up in the legal world (as opposed to the technical world) began asking questions about how the math was being applied.  Specifically, some of our questions were about sampling sizes.  Are we getting the appropriate samples that will let the technology “learn” and model whatever it is we are trying to attain? 
ESI Advantage wrote about this problem in May 2012 – “Are your samples Random? Are you just getting random results?” 
The problem is very simple in both the legal world and in the world of political polling.  It is very easy to get your inputs wrong.  The problem has to do with both the sample size and the actual nature of the sample.  As explained more fully by reading a Ralph Losey work, or the many posts on ESIAdvantage, the problem is with how sampling is being performed.  In political polling, pollsters randomly select potential voters to call and as ask a very short and list of questions that require a “Yes”, “No”, or “Undecided” answer.  The “math” tells the pollster how large the sample size needs to be to meet a mathematical margin of error.  Based upon the size of the population, the math tells the pollster how many potential voters need to be polled.  Pollsters collect additional information about the profiles of those that were interviewed such as race, religion, how they have voted in the past, and so on.  That information is input into the technology and modeling is created that should predict who’s leading the polls and the demographic of voters who are likely to vote for one candidate or the other. Those analytics help decide how to message issues directly to those that were polled.
Sample size calculation to meet a specified margin of error – say +/-3% - is based upon a large percentage of those sampled.  20% – 50% will answer Yes or No, with the remaining being categorized as “Undecided”.  The composition of the sample is critically important.  The sample size in political polling is usually a few thousand out of many millions, so the nature and input of those sampled is critically important.  All downstream analytics are based upon those answers.  Again, a bad sample equals a bad result.
In the litigation world, we are usually also sampling many millions of “documents”.  In the early days, some of us immediately saw the problem: the sampling math doesn’t work.  Why? The likelihood of a positive answer from a purely random sample had about a 1% chance of being relevant, if the sampling size of what the algorithm uses relies on a high “richness” of positive answers.  In litigation, our “richness” is very low and the sample size calculation doesn’t work and the likelihood of getting relevant documents in your sample is very low.  The technology therefore has a much harder time determining what is relevant in the population as a whole because the algorithm is making those predictions based upon the textual content and concepts contained within the documents being sampled.  The legal experts have literally spent years debating this problem with the technology experts who design the tools and tweak the algorithms being used to fit our needs.  Many argued early on that we need to “stack the deck” and raise our ability to locate relevant documents in our sample so that the math will work.  Many technologists debated that it does not fit how the technology is designed.  The approach we recommended is now referred to as “judgmental.”  We select known relevant and irrelevant samples.  Very often we will use search terms that will raise the level of relevant documents in our sample.  This allows us to sample the right number of documents to “train” the technology about a relevant document profile, so that the analytics work.  Today, the debate over “judgmental” vs “statistically random” sampling is over with just a few technologists still holding out, largely because they are still peddling outdated technology and methods.
Of course, you rely on the answers you get during sampling to be truthful, accurate and know that it doesn’t change until you take a new sample.    This is because the person reviewing the sample, usually a lawyer at a very high hourly billing rate, changes their mind about what is relevant as they learn from reviewing.  What is relevant can and does sometimes change dramatically.  If we are using what’s called an “active learning” approach, the technology adjusts the kinds of samples being presented to the lawyers with each new sample reviewed, determining what it needs to learn based on how the system decides what changes in relevance. 
So, what went wrong with the how pundits projected the election?  As it turns out, there were many more “undecided” voters that the pollsters predicted and the wrong profiles were being sampled.  Many of those polled either didn’t reveal that they were going to vote for Trump, or they simply changed their mind.  Like we did in the early days of using analytics in litigation, the pollsters used the wrong methodology, not the wrong technology.
In litigation, most of us have resolved the problem.  We were driven to do that sooner rather than later, because if we have a miss, it costs our clients millions of dollars in review costs, because we are sending too many documents  deemed relevant to lawyers who are billing by the hour.  We have to use the latest technology and do not have the luxury to be wrong in litigation.   More importantly, we have to use the right methodology.
In summary, the Democrats were using the wrong consultants, and the wrong approach.  In the months and years to come, we hope to see vast improvements in how political polling is conducted.