Wednesday, March 1, 2017

Part 3:  Combining Predictive Coding and Search Term Classification in 5 Easy Steps

By Mark G. Walker, VP Advisory Services and 
Robin Athlyn Thompson, VP Marketing | Business Development

This week, part 3 of our 5-part series on combining search term classification and predictive coding. In case you missed Part 1, you can find it here.  You can find part 2 here.  


Step 3: Process the Good Stuff


Once you’ve eliminated everything that you can objectively eliminate, it’s time to process.  Processing is the act of extracting metadata, content, indexing, analyzing and staging ESI for review/production.  Some steps, such as indexing content, can be a second or third stage, depending on the service provider’s capabilities.   The first stages of ingesting ESI is often referred to as pre-processing.  As noted in Step 2, all container files are opened, and individual files are created during processing.  Emails and attachments, for example, are pulled from the PST container and presented as individual files rather than a single container file.  

Once processing is complete, apply your “objective” filters identified in Step 2 again so that you can identify files coming from containers that can be suppressed from downstream processes.

Unlike prior workflows centered on applying search term filters at this stage, you SHOULD NOT filter by search terms during processing, unless you’re using terms that are validated using a process outlined in Step 4 and will not change going forward.  Even those of us expert at developing search terms should remember that using those search terms during processing may result in pulling a large percentage of irrelevant documents.  The fact is we can’t be certain how well search terms perform until we perform sample review and testing.  At minimum, we encourage you to perform these minimum tasks discussed here.

Finally, as processing extracts domains, we recommend you seek a report of domains present in the ESI and filter-out emails from domains that are clearly junk.  Emails from cnn.com, for example, may be clear spam emails.  Some processing applications have rudimentary review and tag functions designed precisely for this purpose.  Be careful, however, as anything you do in terms of filtering during processing can have a negative impact downstream.  Regardless of whether you filter out junk domains during processing, you will want to do that step (again if you did so during processing) once the ESI resides in the review/analysis platform. Here are a few things to consider during processing.  This is not intended to be an exhaustive list. 

  1. Apply Objective Filters – Apply again any objective filters that where applied during Step 2.
  2. Consider “Pre-Processing” steps – It may dramatically speed up processing to utilize a multi-stage processing work flow.  For example, you may not want to extract text and conduct indexing on files that may be filtered out.
  3. Be Careful with Search Terms - Before applying search term filters during processing, consider very carefully the consequences.  There are serious ramifications to deduplication, for example, if your search terms change and you receive new data that may apply a different set of terms.
  4. Domain Filtersidentify junk domains and eliminate files associated with clearly junk emails.

Stay tuned next week for Part 4:  Validate Key Terms Before You Agree to Them

No comments: