Friday, June 22, 2012

TAR - Is Your Seed Sound?

Those that are using Technology Assisted Review (“TAR”) already know that the technology is sound.  As The Digital Advantage (TDA) has written before, it is not the technology we should question necessarily.  Rather, success relies upon how the technology is used and what one does with the results.  Most applications deploying TAR use sophisticated “find similar” algorithms to compare content patterns of one group of documents to another.   Using sampling techniques, small groups of documents are reviewed by subject matter experts and then the results are compared to the larger corpus by the technology.  The corpus is defined by the investigator, whoever that might be.  The technology then ranks each document in the corpus against the small expert reviewed sample.  Some have referred to this as a Google ranking of sorts. This small sample is generally referred to as a “Seed”.  There may be many seeds used in large projects.  Seeding, by the way, is a technique that has been used for just about as long as we have had search and review technology.  What’s different today?  The technology has gotten much better both in terms of sophistication as well as our ability to define and implement workflows.
So, the seed is critical right?  Bad seed, bad result.  And when we say the quality of the seed is important, we are not just talking about the quality of the seed review by the expert.  The make-up of the seed can make or break the result in terms of quality and/or time.  TAR technology and sampling methods are based upon random sampling a percentage of the target population of documents.  Traditional sampling methods assume a high relevance content within the overall population being investigated.  The lower the relevant content, the larger your sample size should be, technically.  For example, some sampling calculations assume that 50% of the sampled population is relevant.  A relevancy rate that high in our world is seldom the case.  Relevancy rates below 1% are common and rarely over 10%.  So, your random sample is a bit like a shot gun blast in the dark. 
Does that mean that we should have less confidence in TAR and sampling mythologies? No, it most certainly does not in our view.  Rather, doesn’t it make sense to create a better seed and increase the accuracy?  By the way, even a bad seed is superior to the common practice of throwing terms at the wall and seeing what sticks, but we digress.
As TDA has opined many times, search terms are the only truly transportable objective content filters that the parties can agree upon.  Using seeds constructed and reviewed from validated search terms increases dramatically the impact of the seeds and the success of any TAR workflow.  Far fewer documents will be reviewed.  Do you have sound auditing methodologies?  Are you just throwing technology at the problem, or are you using tried and true workflows? 

No comments: