Bloomberg Law
December 23, 2015, 9:33 PM UTC

The E-Discovery Challenge: Finding a Needle As the Haystack Grows

Greg Schodde

Editor’s Note: The author of this post is a shareholder at a Chicago-based intellectual property law firm and co-chair of the education subcommittee of the Seventh Circuit E-Discovery Pilot Program.

By Greg Schodde, McAndrews, Held & Malloy

I first heard the old saw, “Every case comes down to 20 documents,” working on my first big patent case. I didn’t believe it. How could boxes of highly technical and relevant documents be reduced to such a small set?

Well, a year later, with one box of about 20 critical documents, we were ready for trial. We had started with nearly 200,000 pages of documents; had no electronic database of images; and documents were pulled from filing cabinets.

Today, all of the documents from that case would easily fit on a disk drive that is electronically searchable from virtually anywhere. E-discovery is not constrained by the physical limits of paper and human performance. Every document can be pulled up, annotated, and displayed in the courtroom. What has not changed, though, is the need to find those 20 critical documents.

However, the technology that should make finding these critical documents easier has, in some ways, actually made it harder. From high costs to excessive reproduction, e-discovery, to some, remains a substantial burden.

Feeds the temptation to demand and collect large compilations of data

E-discovery tools and companies are plentiful, and many in the legal industry feel the need to get their hands on as much information as possible to build the best possible case. While e-discovery allows law firms to collect more information in a shorter amount of time, our ability to meaningfully process what comes out at the end has not changed. The average reader can read about 200 words-per-minute; a typical civil jury trial cannot be scheduled for more than a few weeks and is typically shorter. This doesn’t allow enough time to review and truly understand more than a tiny fraction of the documents in a case.

Fuels the tendency to distrust the opposing side

Confronted with growing “haystacks” of raw data, there is a tendency to suspect that the other side has not actually disgorged the required materials. There are high-dollar costs being inflicted for handling larger and larger volumes of data for less and less return per unit of collected data. And the larger the data set to start with, the harder it is to prove with confidence that the “best” documents are or are not being produced. This can lead to further demands for even more information or more time, energy and cost being expended to process the data. Electronic search tools are also poor at retrieving non-text documents and documents that are only important because of where they are, not because of what they say.

When the number of truly critical documents (the “needle”) remains the same size, the ability to find them declines the larger the database becomes (the “haystack”).

Leads to counter-productivity

Another issue is the practical limits of search tools for filtering the ever-larger collections of data, as well as the physical infrastructure that makes it possible to electronically review more documents than ever before. The typical metrics for evaluating the “goodness” of searches, precision and recall, continue to improve. But even with improvements this does not necessarily translate into actually finding those critical 20 documents. This is because when the number of truly critical documents (the “needle”) remains the same size, the ability to find them declines the larger the database becomes (the “haystack”).

Results in higher costs for advanced search tools

Another way to express this characteristic of searching is to add a third metric: utility. The utility calculation assigns a per-item cost to “missed” relevant documents, and retrieved but not relevant documents. A given search applied to a bigger haystack will generate more of both, so one simple way to make the utility metric worse in any situation is to assemble the largest possible haystack around those 20 documents that actually matter. The normal biases in commercial cases take us in that direction. Counsel in high stakes cases assign high values to “missed” important documents because that cost is going to be assigned to them. Counsel will tend to assign low or no cost to retrieved but irrelevant documents because those extra handling costs are usually assigned to the client. Client representatives may also be more concerned about accusations that they failed to forward critical information than concerns about the efficiency of the search process, and the electronic format make it physically much easier to “send everything” than to be more selective. They may even feel they are further helping their case by imposing a lower utility search on the opponent.

In most cases, the client has a good understanding of where those 20 critical documents are likely to be found. Initial review of core information establishes relationships with the key persons involved in the case and educates the lawyer about how the documents fit together, before the discovery machinery becomes so full of data that context and core files are lost in the trees. A comprehensive review of those files at the beginning, before large “haystacks” have been assembled, is likely to capture 80 percent or more of those 20 key documents. This will allow your legal campaign to be better focused and effective from the start, and your opponent may be surprisingly receptive to accepting a much smaller initial production – and might even agree not to demand the assembly of the large haystack at all.

Learn more about Bloomberg Law or Log In to keep reading:

Learn About Bloomberg Law

AI-powered legal analytics, workflow tools and premium legal & business news.

Already a subscriber?

Log in to keep reading or access research tools.