Articles

New Machines, Old Tests: Using AI to Run Internal Investigations Regulators Will Credit

New York Law Journal
Share This Page:

A general counsel opens an internal investigation. The document set runs to two million records. Outside counsel proposes to run the entire collection through a generative artificial intelligence platform, performing the first-level review itself and classifying each document for relevance.

The work that once occupied forty contract attorneys for six weeks could be done over a weekend, at a fraction of the cost. If the review holds up, outside counsel will present the findings to the Department of Justice (DOJ) and the Securities and Exchange Commission (SEC) in pursuit of cooperation credit, or even a declination. Everyone in the room wants to know the same thing: will the government credit an investigation built on this approach?

The concern is understandable. But the legal system has been working through these questions for more than a decade, and the answer has less to do with the technology than lawyers tend to assume.

No ‘Approved’ Tool

Lawyers often say that courts or the government have “approved” technology-assisted review (“TAR”), and that newer tools must clear the same bar. That is a misconception. No court or agency has ever approved a specific document-review product. They have accepted a methodology, provided its results can be tested.

The litigation context shows how this works in practice. In Da Silva Moore v. Publicis Groupe & MSL Group, Magistrate Judge Andrew Peck became the first judge to recognize TAR as an acceptable way to search for relevant electronically stored information. 287 F.R.D. 182, 183 (S.D.N.Y. 2012), adopted, 2012 WL 1446534 (S.D.N.Y. Apr. 26, 2012).

Three years later, in Rio Tinto PLC v. Vale S.A., Judge Peck observed that the point was settled: it had become “black letter law that where the producing party wants to utilize TAR for document review, courts will permit it.” 306 F.R.D. 125, 127 (S.D.N.Y. 2015). A party simply chooses a method and bears the burden of defending it.

In Hyles v. New York City, Judge Peck refused to compel the defendant to use TAR over its objection, holding that under the Federal Rules the standard governing a party’s chosen review method is “not perfection, or using the ‘best’ tool,” but whether the results are “reasonable and proportional.” No. 10 Civ. 3119, 2016 WL 4077114, at *2–*3 (S.D.N.Y. Aug. 1, 2016).

The principle is the same across all three cases: pick a method, validate the results, and be prepared to defend it. That principle carries into the investigation context, though the legal standards differ.

Regulators arrived at this same conclusion. The Antitrust Division of the DOJ built TAR into its Second Request practice through its Model Second Request, which does not bless any software. It conditions acceptance on process: the producing party must confirm that subject-matter experts review the training rounds, report recall and precision statistics, and submit to government sampling of the non-responsive set. See U.S. Dep’t of Justice Antitrust Division, Model Second Request, Instruction I(5) (Dec. 2016).

The SEC is more direct still. Its Division of Enforcement Data Delivery Standards provide that any proposed use of computer-assisted review or TAR “must be discussed with and approved by” Enforcement’s legal and technical staff before a production. The logic is the same across both agencies: show your work, prove it held up, and the result will be credited.

Separating the Machine from the Math

To see why this matters for AI, we need to separate two things the phrase “technology-assisted review” jams together: the machine that classifies documents, and the statistics that prove the classification worked.

The machine has evolved—from keyword search to predictive coding to generative AI—but the math has not. The validation framework works in two steps. Before review begins, a party draws a random sample of the full collection to estimate what share of documents is responsive—establishing a baseline for what a complete review should find.

After review, a second random sample is drawn from the documents set aside as non-responsive and reviewed by a human. The share that turns out to be responsive is the elusion rate—the fraction of relevant documents the review missed. A low elusion rate supports a decision to stop; a high one does not.

The elusion rate should also be read against the collection’s overall richness: in a collection with very few responsive documents, even a low elusion rate can represent a meaningful number of missed documents in absolute terms.

That test applies regardless of what machine produced the classifications. This two-sample approach is the conventional validation method, though other approaches, including seed-set and precision-sampling methods, are also used. The method chosen is not a technicality; counsel should expect a regulator to ask why it was appropriate for this collection and whether the result demonstrates the review found enough.

Three features of AI review complicate the reliability picture. Variability is the first. A large language model is probabilistic, and the same document and prompt can yield different classifications on different runs.

Second is hallucination. A generative model can produce fluent, confident output that is simply wrong: a summary describing a document that does not exist, or a relevance rationale untethered from the text. Presenting that to a regulator would be fatal, which is why human fact-checking against source materials is not optional.

Third is prompt sensitivity. The words a lawyer uses to instruct the model matter; small differences in phrasing can produce meaningfully different results, even when those phrasings would be treated as legally equivalent by any human reader.

In enforcement investigations, counsel routinely disclose to the government the search terms they used and the document universe they ran them against so the government can assess whether the scope was appropriate. AI-assisted review calls for the same transparency.

Disclosing the prompts used, the model deployed, and the document population gives a regulator the same assurance about completeness that search term disclosure has long provided.

The Guidance Gap Is Not a Vacuum

In the litigation context, courts have worked out a clear framework for technology-assisted document review: the producing party chooses a method, validates the results through statistical sampling, and bears the burden of defending its choices. No equivalent framework yet exists for AI-assisted internal investigations.

No regulator has issued guidance on using AI to conduct one. The DOJ’s updated Evaluation of Corporate Compliance Programs asks whether companies measure the “accuracy, precision, or recall” of their data analytics tools, but that addresses AI as a risk the company must govern in its own operations. It does not address whether counsel may use a large language model to review documents. On that, the agencies are silent. The litigation analogy is useful but imperfect; the legal standards differ, and counsel cannot simply transplant e-discovery conventions into the cooperation context.

Silence is not approval. Without formal guidance, there is no safe harbor. But the stakes are set by the cooperation regimes, which are explicit. In May 2025, the Criminal Division revised its Corporate Enforcement and Voluntary Self-Disclosure Policy and issued a companion memorandum making efficiency a stated priority and directing prosecutors to minimize the length and burden of investigations. See U.S. Dep’t of Justice Criminal Division, Focus, Fairness, and Efficiency in the Fight Against White-Collar Crime (May 12, 2025).

Under the revised policy, a declination requires both voluntary self-disclosure and full cooperation, separately defined obligations. The SEC’s framework runs parallel. Its long-standing Seaboard Report ties cooperation credit to the completeness and quality of a company’s self-investigation.

The SEC Division of Enforcement’s 2026 update to its Enforcement Manual values early cooperation over delayed assistance and defines exemplary cooperation to include summarizing the factual findings of an internal investigation and identifying key documents and witnesses.

Those regimes also explain why the features of AI review align with what the cooperation framework rewards. The DOJ awards the greatest credit to companies that cooperate early, and penalizes those that identify significant facts but delay disclosure. See JM 9-28.700.

A review completed in a weekend rather than six weeks lets a company surface individual wrongdoers while memories are fresh and deliver a complete factual record before the government’s parallel investigation gains momentum.

Counsel should be clear-eyed about one distinction. In civil discovery, under Federal Rule of Civil Procedure 26(g)(1)(B), the test is whether the search was reasonable and proportional; recall rates in the range of 70 to 80 percent have often been treated as adequate.

The cooperation context asks a different question: has the company timely disclosed all relevant facts about the misconduct? See JM 9-28.720. The government wants to know whether the company found what was there to be found.

A producing party in civil litigation can defend a cutoff that leaves some responsive documents unreviewed so long as the effort was proportional; a cooperating company that misses a hot document cannot invoke proportionality as a defense.

Whether a recall threshold calibrated to civil discovery is adequate when the government is relying on the review to identify documents and individual wrongdoers is something counsel must work through before committing to AI. In the cooperation context, a near-exhaustive recall target may be more defensible than the marker of approximately 75 percent accepted in civil discovery.

Counsel should also be prepared to explain what types of documents the validation sample found in the discard pile. A five-percent elusion rate composed entirely of marginally relevant documents presents a very different picture than one driven by a smoking-gun email.

Designing an AI-Assisted Internal Investigation

The document review is the foundation of the investigation. AI-surfaced documents should shape interview preparation and witness sequencing, and they drive the factual narrative the company ultimately presents to regulators.

How AI is deployed determines the validation required and the credibility of everything that follows. In a triage use, AI surfaces hot documents early and a human review follows; missed documents are caught downstream. Formal statistical validation is not required, though it adds credibility.

The key is preserving a genuine downstream review. Where counsel focuses almost entirely on what the AI surfaced, without independently examining what was set aside, the review has become a culling exercise in practice, and the validation demands are the same as for the higher-risk approaches below.

In a prioritization use, AI ranks the collection by likely relevance; reviewers work in that order and stop before the tail. The validation question is whether the miss rate in the unreviewed portion is low enough to justify stopping. In a streamlining use, the tool’s classifications exclude documents from review entirely, with no human examining the excluded set. That decision rests on statistical validation, which must be real, documented, and in the cooperation context disclosed to the government. Whatever the machine, the discipline is the same.

A question counsel often faces is when to tell the government that AI assisted the review. No rule requires disclosure, but transparency tends to serve the cooperation narrative better than concealment followed by discovery. The better approach is to raise it affirmatively, framed as a process advantage.

A proffer might note that the company used an AI-assisted review methodology, that the methodology was validated through statistical sampling, that the validation confirmed a recall rate of a specified percentage, and that the company is prepared to discuss its methodology in detail. Delivered that way, the disclosure positions the company as having nothing to hide and shifts the conversation to the reliability of the process.

Counsel should anticipate follow-up questions about what model was used, what prompts were applied, how the training or classification worked, and what the validation sample revealed about the types of documents the model missed. Documenting those answers in advance is part of designing the review.

Conclusion

Whether an AI-powered review can be acceptable to regulators was always the wrong question. The right one is whether the company can prove the review worked, and in the cooperation context, whether it worked completely enough and quickly enough to earn credit. A validated AI-driven investigation, delivered early and documented transparently, can do both. The machine changed. The test did not. But the test, applied rigorously, may now favor the machine.

--

This article first appeared in the June 17, 2026, edition of the “New York Law Journal” © 2026 ALM Global Properties, LLC. All rights reserved. Further duplication without permission is prohibited, contact 877-256-2472 or reprints@alm.com.