Any attorney who has been through electronic review of a client’s documents will have likely heard of de-duplication, the automatic process of identifying and removing duplicate documents from an electronic data collection. Some may also be familiar with the confusion of the process and have experienced frustration with this essential, but incomplete, culling technique.
This article is intended to explain de-duplication – what it does, how it does it, why it is so often imperfect – and to describe vdiscovery’s solution for making de-duplication much more effective, without adding costs.
What is de-duplication?
As mentioned above, de-duplication is the automatic process of identifying and removing duplicate documents in an e-data collection, usually prior to attorney review. It is generally performed as part of data processing, and at no additional cost.
De-duplication is to be distinguished from “near de-duplication”, which looks at a document’s text and order to determine similarity, and can thus spot and group the documents that are “near-dupes.” Unlike de-duplication, “near de-duplication” is only run after the data has already been processed and loaded to the system. Near de-duplication is generally run using Structured Analytics, and is normally charged on a per gigabyte basis at a cost that is distinct from standard processing.
By contrast, standard de-duplication leverages a completely different technique, using the digital fingerprint of each document, utilizing a Hash value, to identify duplicates. Hash values are employed for identifying duplicates, and can result in the over-inclusion of duplicates into the data set available for review, as will be further described below.
Why should I care about de-duplication?
One reason de-duplication is of interest is an attorney’s obligation, per the Model Rules of Professional Conduct to “provide competent representation” with the reasonable “knowledge, skill, thoroughness and preparation” required for the case and client. Since de-duplication is now a standard part of electronic discovery, and could affect whether duplicative documents are reviewed and/or produced, it is ethically incumbent upon attorneys to understand how de-duplication works.
Another, perhaps better, reason to know about de-duplication is the time and cost-savings this technology offers, benefiting both attorney and client. De-duplication is used to detect exact duplicate documents, which can be voluminous, and to remove those documents from review. For voluminous reviews requiring multiple reviewers, de-duplication will additionally help avoid any potential non-uniform coding of duplicate documents, by preventing the duplicate documents from ever getting into the documents needing review.
Thus, de-duplication has the potential to save a party significant time and money, and avoids reviewer mistakes.
MD5 Hash: The Science
MD5 is a message digest hashing algorithm that generates a 128-bit cryptographic hash value, presented as a 32-digit hexadecimal value, which is used for data integrity and for uniquely identifying documents. Essentially, the MD5 serves as a “digital fingerprint,” which means that two documents with exactly the same content and metadata will each be assigned the same hash value. On the other hand, if there are any changes to the content or metadata, the document will get a different hash value.
In eDiscovery, one of the most commonly applied algorithms is the MD5 (MD is short for message digest) and an MD5 hash would look something like:
This is just one type of Hash Value that is used for identifying duplicates. One alternative Hash value used for de-duplication, and more modern hash value, is the SHA1 hash value, which is generally considered a more secure cryptographic hash function. In contrast to the MD5, the SHA1 hash value is a 160-bit message digest and renders as a hexadecimal number that is 40 digits long. In fact, Relativity actually uses an even more updated hash value, the SHA2 hash value, specifically the SHA-256 Hash function, which is a 256-bit message digest and renders as a hexadecimal number that is 64 digits long.
Some Examples of Exact Duplicates
One common example of duplicate documents are emails exchanged between two or more parties.
Where emails are collected for two individuals (also called email “custodians”) that communicated frequently, there will be two copies of each email they exchanged. One copy of the email will be found in the sent box of the sender and the other will be in the inbox of the recipient. Likewise, multiple recipients, or ‘CC’ recipients, of an email will reside in each recipient’s inbox.
In data collections comprised of multiple custodians, “global de-duplication” will remove the most duplicates, leaving only one version of each email in the review platform. In such a case, the metadata, e.g. To, From, and CC will still indicate which other parties received the message.
If de-duplication is run “per custodian”, only duplicates identified within the data of an individual custodian will be removed. For example, if the custodian’s address was listed twice in the To or CC fields, whether for the same or different email addresses (and both email boxes were collected), only one copy will be reviewed.
Duplicates could also exist in the “loose e-docs” of one or multiple custodians, such as their Word, Excel, PDF, or video files, or in program files, where multiple copies were stored on the local hard drive or on a fileshare.
While de-duplication is a standard, widely accepted process, it is far from perfect.
This is particularly true for emails, which contain numerous metadata fields utilized for generating a hash value, and the same email may have slight differences depending on the method with which it is preserved. For example, an email sent from a Gmail email to a Microsoft Exchange email address might be without a footer, but the email preserved in the Microsoft Exchange environment may have a footer or other character “not found” in the original email stored on Gmail. If the same email was archived from Outlook Desktop, the same copy could contain still other characters or metadata, resulting in that email obtaining a different Hash value from the other duplicate email.
Where these duplicate “false negatives” are most frequently problematic is when processing large sets of data from multiple sources and custodians. In such cases, after running de-duplication on millions of documents, there may still be many thousands of apparent duplicate documents loaded to the review platform, and can result in significantly extended review times and inflated costs.
A related issue arises for the recipient of a “data dump”, where the producing party (intentionally or otherwise) processes and produces a large volume of data, including all duplicate documents. The production of numerous duplicate documents might force an attorney to spend an inordinate amount of time and money on onerous attorney review of the duplicates. Thus the receiving party finds it is stuck between the proverbial “rock and a hard place”.
Solution: vdiscovery’s BatchGuru
Enter vdiscovery’s proprietary BatchGuru suite of tools (recognized by kCura as a best-in-breed Relativity Ecosystem development). BatchGuru provides a powerful solution for these duplicate doldrums, which can otherwise stymie a party’s required review, productions, and trial preparation: BatchGuru’s “Custom Hashing” tool.
BatchGuru’s “Custom Hashing” tool enables vdiscovery to generate custom hash values based on a careful selection or exclusion of fields used to identify duplicates. Thus, where Outlook and Gmail sources were collected, and differential characters prevent the identification of duplicates, our analysts can identify the offending field, and use our Custom Hashing tool to isolate and resolve it. The resulting Hash value allows us again to detect and remove the documents that, “but for” the incidental differences in source processing, are true duplicates.
vdiscovery’s “Custom Hashing” tool is not only good in theory; vdiscovery employs it frequently to the benefit of our clients. In one recent instance, vdiscovery applied custom de-duplication to an email set of nearly 600,000 emails, effectively reducing the reviewable document collection by over 100,000 emails.
Not only can Custom Hashing help reviewers spot and remove duplicates, it can be used to spot problematic or incomplete productions by an adversary. This can be accomplished by generating a customized hash value for critical documents that should be expected in outside productions, and then generating the same hash value on the outside production to see which, if any, of those critical documents are actually received.
In summary, Hash values serve an essential role in allowing parties to authenticate documents and to identify and remove exact duplicates where appropriate. Still, Hash values are not without their shortcomings, and the peculiarities of email servers will often keep apparently exact duplicate documents within a collection, cluttering review databases with extraneous documents, much to the chagrin of the reviewing attorneys and paying clients.
In either instance vdiscovery’s Custom Hashing tool can help attorneys contain costs, meet those court deadlines, and strengthen the client’s case.
Custom Hashing is just one example of vdiscovery’s “Strength in Fact”. To find out more about Custom Hashing, or any of the other value-adding tools of Batch Guru, check out our feature page and request a demo at email@example.com.
BONUS: Check out our BatchGuru feature videos!