Sunday, May 13, 2012

Are your samples Random? Are you just getting random results?



Introduction


There has been much discussion of late about a variety of "emerging" technologies.  Predictive coding, err um, technology assisted review (TAR), statistical random sampling (SRS), what some algorithm does or does not do, complete with panels of experts to explain this that or the other. Entire days of testimony are being devoted to peeking behind the curtain of “predictive coding” algorithms. The Digital Advantage continues to ask, why all the fuss? Shouldn't we be focused on the merits some say?  Why all this math?  After all, most lawyers will tell you they are lawyers today because of the supposed lack of math, but we digress. 

It’s Just Math


Ralph Losey in his most recent treatise on random sampling is quite the read.   Ralph predicts (trumpets sound)…

“….in the year 2022 a random sample polling of American lawyers will show that 20% of the lawyers in fact use random sampling in their legal practice. I make this prediction with an 95% confidence interval and an error rate of only 2%. I even predict how the growth will develop in a year by year basis, although my confidence in this detail is lower.” (Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022)
Ralph’s prediction, of course, is tongue-in-cheek. Well, sort of. The math behind sampling is serious. When you consider that unlike the broader world where statistical models are intended in studies or polls, in the document world the all important base line changes from one matter to the next. Put another way, statistical modeling is only as good as the information you feed into it and any assumptions that you apply. By way of example, Ralph’s conclusion that 300,000 lawyers will utilize random sampling by 2022 may be flawed.

Assuming that by the year 2022 there are 1.5 Million lawyers (the ABA estimated there were 1,128,729 resident, active lawyers in 2006), I predict that 300,000 lawyers in the U.S. will be using random sampling by 2022. The confidence interval of 2% by which I qualified my prediction means that the range will be between 18% and 22%, which means between 270,000 lawyers and 330,000 lawyers. I have a 95% level of confidence in my prediction, which means there is a 5% chance I could be way wrong, that there could be fewer than 270,000 using random sampling, or more than 330,000. (Id.)
  
Flawed you say? The equation for determining sample size based upon known and desired factors such as population size; tolerable margin of error and other factors are well settled. No, the math is not questionable. It works. We have seen it over and over.  Trust those math folks.  Be careful with your inputs.


In terms of the numbers you selected above, the sample size n and margin of error E are given by
x = Z(c/100)2r(100-r)
n = N x/((N-1)E2 + x)
E = Sqrt[(N - n)x/n(N-1)]
where N is the population size, r is the fraction of responses that you are interested in, and Z(c/100) is the critical value for the confidence level c.
Rather, the math only works if your inputs and assumptions are sound. Here, not all 1.5 million lawyers Ralph assumes will exist in 2022 are litigators. We don’t expect that real-estate and tax lawyers will be utilizing random sampling related to document review. Those contract lawyers are unlikely to be interested in sampling as well. So, the population Ralph starts with may be far less than 1.5 million. Although, we have not audited Ralph’s results and that is not really the point. Here, the result would be sampling more than you need, which is not a bad thing necessarily. The result would be better. But, that simply re-enforces Ralph’s over arching point and ours here – it is not precise statistics that are important. Statistical sampling is a tool among many other reinforcing tools. You don’t have to be a Ralph Losey type lawyer and gain an understanding of statistical sampling (the underlying math), or hire an expert to explain it to a judge or jury. Sample size is important so that you are gathering enough inputs and that those inputs carry the least amount of risk you are going to miss important information. The process used should measure objective information. Results should be validated and audited, so getting a precise sample size is not as important as using some rule of thumb that is repeatable. Statistical sampling is simply a method by with you are organizing documents upon which to gather what to tell the machine. When you consider that less than 1% of all documents that have any value at trial, reviewing everything simply is not possible, nor necessary in virtually all cases.

“I saw one analysis that concluded that .0074% of the documents produced actually made their way onto the trial exhibit list-less than one document in ten thousand. And for all the thousands of appeals I’ve evaluated, email appears more rarely as relevant evidence.” DCG Sys., Inc. v. Checkpoint Techs, LLC, 2011 WL 5244356 at *1 (N.D. Cal. Nov. 2, 2011) (quoting Chief Judge Rader)

Follow a Simple Process

Unlike the use cases for which random sampling models were built, in the document review and production world, we are not shooting in the dark. While it is true that in almost all cases the relevant material is very small in proportion to the amount of material available for analysis, we have a pretty good idea what words and phrases appear in relevant documents, providing at least a start. The subject matter is known. Filter parameters can be narrowed by date, authors and recipients and any number of other known factors. In the old days – those days before technology – we just knew where to look. Today is no different except that we now have technology to help us. Technology helps us increase the odds in our favor. Audits will identify new terms, phrases and concepts for the technology to use to find new sources. Sampling is not so random.

It is becoming common place to agree upon and use search terms, often without any testing or validation of those terms what-so-ever. Wouldn’t it be important to know for certain, say with a 2% chance of error, that term you chose would return relevant documents? Don’t you want to know what percentage of your effort will be wasted if you review all documents hitting a specific term? Why not take a “statistical” sample of all documents hitting that term and measure the relevancy rate? You don’t need to prove what’s statistically appropriate, there are ample calculators that will “do the math” for you. The math has been proven. See sample size calculator by RAOsoft.com and Ralph Losey’s “worst case scenario”  sample predictor. Using statistical sampling calculators inside a well reasoned process to, as an example, test the validity and recall rates of terms that are being contemplated is not something that should have to be defended. You are simply using a calculator to help you find the best possible samples upon which to measure a term(s) effectiveness. Ultimately, it is the term (along with other objective filter parameters) that are agreed upon, not what constitutes a sound statistical sample. In other words, the result and the confidence in that result, not necessarily how that sausage was made. Humans, not machines, are deciding what terms, topics and concepts are relevant. The technology simply finds documents with content similar to that which a human has decided as relevant. That’s why some call this emerging technology “machine learning”.

Today, agreeing upon a set of term or phrases remains the only reliable objective filter that can be agreed upon and easily transferred from one technology to the next. Terms that are validated utilizing a repeatable and quantifiable methodology is going to make it much easier to defend choice of terms. And oh by the way, these are not things about which we are guessing. Don’t guess, get help.



No comments: