TAR - Is Your
Seed Sound?
This is a republish of a TDA post from June 2012 and has been slightly revised. Not much has changed over the past 3 years, except that conventional wisdom now is that seed sets do not have to be created by "subject matter experts". See
Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014 (The Grossman Cormack Study). Indeed, The Grossman Cormack study validates what many commentators have been writing for years - purely random sampling is rife with risk and error. A combination of judgmental (using search terms) and random sampling to create seed sets is the superior approach.
Those that are using
Technology Assisted Review (“TAR”) already know that the technology is sound. As
The Digital Advantage (TDA) has written before, it is not the technology we should question necessarily. Rather, success relies upon how the technology is used and what one does with the results. Most applications deploying TAR use sophisticated “find similar” algorithms to compare content patterns of one group of documents to another. Using sampling techniques, small groups of documents are reviewed by investigators/reviewers and QC’d by subject matter experts. The results are then compared to the larger corpus by the technology. The corpus is defined by the investigator, whoever that might be. The technology then ranks each document in the corpus against the small reviewed sample. Some have referred to this as a Google ranking of sorts. This small sample is generally referred to as a “Seed”. There may be many seeds used in large projects. Seeding, by the way, is a technique that has been used for just about as long as we have had search and review technology. What’s different today? The technology has gotten much better both in terms of sophistication as well as our ability to define and implement workflows.
So, the seed is critical right? Bad seed, bad result. And when we say the quality of the seed is important, we are not just talking about the quality of the seed reviewed and QC’d by the subject matter expert. The make-up of the seed can make or break the result in terms of quality and/or time. TAR technology and sampling methods are based upon random sampling a percentage of the target population of documents. Traditional sampling methods assume a high relevancy content within the overall population being investigated. The lower the relevant content, the larger your sample size should be, technically. For example, some sampling calculations assume that 50% of the sampled population is relevant. A relevancy rate that high in our world is seldom the case. Relevancy rates below 1% are common and rarely over 10%. A purely random sample of 1,000,000 universe, as an example, is very unlikely to yield very many relevant documents using a purely random approach. So, your purely random sample is a bit like a shot gun blast in the dark.
Does that mean that we should have less confidence in TAR and sampling mythologies? No, it most certainly does not in our view. Rather, doesn't it make sense to create a better seed and increase the accuracy? Utilizing a proven search term validation methodology so that seeds rich in relevant content are reviewed is the better course. By the way, even a bad seed is superior to the common practice of throwing terms at the wall and seeing what sticks, but we digress.
As TDA has opined many times, outside of date and file type filers, search terms are the only truly transportable objective content filters that the parties can agree upon. Using seeds constructed and reviewed from validated search terms increases dramatically the impact of the seeds and the success of any TAR workflow. Far fewer documents will be reviewed. Do you have sound auditing methodologies? Are you just throwing technology at the problem, or are you using tried and true workflows?