Part 2: Combining Predictive Coding and Search Term Classification in 5 Easy Steps
By Mark G. Walker, VP Advisory Services and
Robin Athlyn Thompson, VP Marketing | Business Development
This week, part 2 of our 5-part series on combining search term classification and predictive coding. In case you missed Part 1, you can find it here.
Step 2: Dump the Junk
ESI collections include acquisitions of ESI from laptops, 3rd party sites, file servers, wherever users keep potentially relevant ESI resides. In some cases, entire user hard drives are collected. In other cases, just user files are collected. Whatever the collection method, thousands, millions, even billions of files are collected. Experience teaches us that less than 1% of information collected will prove to be valuable to your case. There are an enormous number of collected files that are of no value. Here are three common objective filters that can be applied to eliminate known garbage before you do any downstream indexing, analysis or classification. This is not intended to be an exhaustive list.
- De-NIST – NSIT is an acronym for National Institute of Standards and Technology. The National Software Reference Library (National Institutes of Standards and Technology, n.d.) is a sub-project of NIST which collects a master list of known computer applications to help maintain the known list of application and system files. To De-NIST means you use these resources to eliminate what are known application or system files that have no value in most cases.
- File Type Filter - Eliminate known file types outside of NIST. In most cases, an inclusive file filter ingests into processing only specific file types of interest. Audio, video, image and other specific file types may be set aside, or not used at all. These file types are very heavy, driving up cost, contain little or no text content and are difficult to analyze, often requiring a different process and workflow. Create a special process for audio/video files that may be relevant. Your eDiscovery budget will thank you.
- Date Range Filter – We would urge caution when applying a date filter BEFORE processing. Processing is the act of extracting metadata and content. This process also expands container files such as email archive PSTs and ZIP files. If you apply a date filter before processing, and container files are being processed, you are virtually guaranteed to miss files of interest. By way of example, if you create a PST archive of my email today, it will contain months and even years of email, yet the date of the PST will be today’s date. If your date range filter does not include today’s date, that PST will be eliminated from processing consideration, even though email within the date range are inside the email archive.
Next week: Part 3 "Process the Good Stuff"