Decoding and deploying legal judgment
Professional judgment is lawyers’ most valuable and nebulous asset. Articulating its elements could help us evaluate the quality of AI's legal work — or even reconstruct our lawyer formation system.
Lawyers who won’t use Generative AI for legal work have several reasons. One of their more compelling objections is this: Because we can’t rely on AI to perform legal tasks flawlessly, we have to apply strict quality-control measures to whatever it produces — but the cost of applying those measures outweighs whatever benefits the AI might deliver.
Suppose, they say, that a simple legal task would take you an hour. The AI might do the job in one minute; but if you spent 5–10 minutes developing the prompt and another 30–35 minutes reviewing the result, you’ve devoted 45 minutes to the effort. The time savings are modest, and if the AI’s output isn’t good enough, you’ll have to try again and you’ll be worse off than when you started.
That’s a legitimate concern, given that Gen-AI legal output isn’t 100% accurate and won’t become so anytime soon. But there are a couple of assumptions hidden inside that objection — and challenging those assumptions will lead us to some interesting possibilities related, oddly enough, to lawyers’ professional judgment.
The first hidden assumption is that doing a quality-control review of AI output would impose an equal burden on all legal work. But as Ethan Mollick pointed out in an article last week, that’s actually not the case. He suggests three variables that help determine whether it’s worth giving a job to AI: Human Baseline Time (how long it would take you to do the work), AI Process Time (how long it would take to ask for, wait for, and evaluate the AI’s work), and Probability of Success (how often the AI produces output you’ll accept).
A one-hour legal task does run the risk that an imperfect AI attempt could set you back. But suppose you have a complex task, one that would take you 8 hours of Human Baseline Time. If the AI can generate a result in 15 minutes, and even if you spent 30 minutes preparing the prompt and over two hours reviewing and fact-checking the result, that’s just 3 hours of AI Process Time. If the work is acceptable, you’ll have saved 5 hours; even if you need a second attempt, you’ll still be ahead by 2 hours. That’s why, counter-intuitive as it sounds, AI might be better suited to tougher, time-consuming legal tasks than to simpler ones.
But don’t forget the third variable, the Probability of Success: how often the AI generates usable output from your prompt. Mollick points to recent benchmark results suggesting that GPT-5.2 achieves about a 72% “win-or-tie” rate on several expert-task benchmarks. (I assume legal-specific AI would do no worse on expert legal tasks than would a general-purpose model.) So roughly speaking, about three times out of four, you’ll come out at least a little ahead by using AI for legal tasks — and sometimes, a lot ahead.
Mollick continues that you can improve those odds in three ways: Give the AI better instructions, evaluate its output more effectively, and give it more useful feedback if the first attempt misses the mark. All three of these methods can be applied in the context of legal work.
Regarding the first, as I’ve previously written, the longstanding practice of “instructing juniors” comes into play. If you can give a junior lawyer clear directions on how to carry out a task and explicit descriptions of the kind of output you’re expecting, then you can also prompt an AI. And the better you get at instructing the AI, the higher your “hit rate” of effective task performance will be.
But it’s the other two ways to improve AI performance — assessing the AI’s output more effectively and giving it better feedback — where things get interesting. That’s because they force lawyers to ask and answer a very difficult question: To what standard of performance are we holding the AI?
This is the second assumption I mentioned above: that we can state with certainty what “good legal work” or “acceptable legal output” looks like. In most cases, I don’t think we can — at least, not until we know the standards we’re using to make that call.
This issue is neither new nor unique to AI. Any young lawyer who’s had a draft handed back by a partner, covered in red ink, knows what it’s like to be graded against a rubric you didn’t know existed until it was applied to you. What’s worse, however, is when you discover that the next partner’s rubric is different from the first, as is the next, and so on. This is how lawyers learn that in real-world practice, quality is contextual: whether your work is “good enough” depends on who’s reviewing it.
Now apply that to AI and scale it up to law firms. The AI takes instructions and dutifully produces output; but every lawyer who reviews that output will use a different standard for “acceptable professional work,” and will therefore give different comments and follow-up directions. That’s because each lawyer is applying their own legal judgment — their invisible and inchoate internal framework for deciding if a contract, opinion letter, or any other legal work product is fit for purpose.
The problem for any law firm with more than one or two lawyers is that it simultaneously maintains several competing versions of legal judgment. Ask ten partners to assess the same work product and you’ll get ten different assessments — many overlapping, a few in sharp disagreement. And if everyone in a law firm has their own standard for judging work, then the firm doesn’t really have any “standard” at all.
This is the issue flagged by Josh Kubicki in a recent edition of his newsletter. Firms are experimenting with AI, but they don’t have a consistent internal benchmark to gauge the results: “They rolled out AI without ever defining the exam it needs to pass.” Law firms ask AI vendors about performance benchmarks, but they neglect to define what “acceptable” means for their own lawyers — a test that usually comes down to: “Would I send this to my client?”
In law firms, lawyers routinely make critical decisions about clients’ interests using their personal legal judgment. That longstanding practice is now extending to decisions about whether AI’s legal work product meets a law firm’s quality threshold — it will all depend on the lawyer reviewing it.
That’s the consistency gap Josh would like law firms to close, by articulating their standards and ensuring that their AI systems meet them. That way, there’ll be less imprecision and guesswork about whether the AI’s output will be acceptable to the firm. His article offers several methods by which law firms can do that; if they succeed, they’ll have taken a big step forward in using AI for legal work more confidently.
But what I find really tantalizing about this line of thinking is the possibility that we could come to more fully understand exactly what constitutes lawyers’ professional legal judgment. If we can articulate the elements of that judgment, surfacing and isolating and identifying its components, we could do more than just improve quality control for AI output. Among other things, we could design and create a whole new approach to lawyer formation — not least because AI itself is in the process of demolishing the old one.
Look closely at the nature of legal judgment. Educational psychologists will tell you that professional judgment is an emergent capability, one that’s formed in professional workplaces through (a) pattern recognition built from repeated, varied exposure to cases, (b) decision-making under conditions of uncertainty, (c) timely feedback from senior practitioners, and (d) reflection that turns tacit experience into more explicit mental models.
This should sound familiar to lawyers. Ask senior practitioners how they developed good judgment and you’ll hear something like: “It grew over the course of time — doing a great deal of legal work, learning by trial and error, and getting direction from senior lawyers until eventually, I got the hang of it.” That’s how we’ve always developed legal judgment, however imperfectly, in this profession.
But as AI displaces lawyers from the performance of more legal tasks, it will also narrow or even remove that first critical step in the judgment-development process: the steady diet of basic work on which young lawyers learned, along with the leverage-based economics that justified hiring them. Tomorrow’s lawyers will need strong legal judgment — given the migration of legal task performance to AI, it will be essential for the high-level services they’ll offer — but the traditional pathway by which their predecessors developed that judgment will fade away.
We’re going to need a new approach to lawyer formation — one that can, among other things, take over and accelerate the development of legal judgment early in lawyers’ careers. But we can’t teach what we can’t describe: If we’re going to make “legal judgment development” an explicit element of lawyer formation, we need to know what that judgment actually consists of.
And this is where we circle back to law firms and their AI evaluation efforts. Suppose that law firms, while building standards for “client-ready” AI output, do manage to surface and articulate the elements of their lawyers’ professional judgment. We could study these components, examine how they emerge and how they’re applied, and figure out how new lawyers can start to acquire them without spending ten years as a law firm associate. By observing what experienced lawyers consider when deciding whether AI work is fit for purpose, we could decode the elements of legal judgment and build them into academic and experiential learning.
Imagine what we could do with verifiable, shareable, teachable components of legal judgment. They could help build the foundation of a post-AI lawyer formation system that doesn’t need the old billable-associate development model. We could help lawyers develop their most important professional asset from day one of their careers.
If we really want to ensure that lawyers develop professional legal judgment sufficient to escape AI’s immense gravitational pull, we need to know exactly what that judgment consists of. And if law firms actually succeed in extracting and articulating the components of legal judgment while testing their AI systems, that would be a nicely ironic twist in the ongoing story of legal sector disruption.



Jordan — once again you’ve hit it out of the park. That said, perhaps that assessment is due to our views being quite consistent.
I’ve long said that the only uniqueitude thing about lawyers is their judgment — and that legal service delivery should involve producing an output for the customer through the use of standardized operations (steps) tied together in a process. That, of course is susceptible to creation and application of SOP’s for different types of legal outputs. In this paradigm, the lawyer is the process architect and process owner — they are NOT the operator — or at least not the only operator for each step in the process. Lawyers should only perform (“operate”) those steps that actually require the application of legal judgment. It’s not that a lawyer cannot perform those other steps, it’s that it’s a misuse & misapplication of a high cost productive asset. In other words, waste.
But in this column, you ask the really relevant question here of what, exactly, is a “lawyer’s judgment”? That’s of course critical to determine which process steps the lawyer should operate.
I’ve thought pretty deeply about this, and its intersection with AI. As AI tools become better and richer, it’s likely that AI generated outputs can be confused with applied legal judgment. That, however, is both right and wrong. Right because generative and agentic AI really just provides the statistical norm of output to the question. Wrong, because what’s generally done might not be appropriate in the specific context. As such, to me, “judgment” comes down to the situation where there is no clear answer — where right/left/stay/go could all be appropriate and are not “wrong” but neither are they “right” in the given context. Determination and recommendation of that course of action is, at least to my mind, judgment.
It's an elegant argument and model, but does the actual airplane fly? Change the metaphor: we agree that GenAI changes the soil in which new lawyers grow. Law professors (not to mention philosophers) have been trying to make the elements of "judgment" explicit for centuries. My favorite relatively recent effort is this piece, from 1998: https://repository.uclawsf.edu/faculty_scholarship/3/ The author, Mark Aaronson, was a clinician and a practitioner, and "judgment" has long been been the coin of the clinicians' realm. Maybe, however, "judgment" isn't a "thing," that is, a product to be cultivated, like a vegetable. Maybe it's a process. Maybe the challenge is not to turn tacit knowledge into explicit knowledge, but to change the what and the how of tacit knowledge production. At the end of the day, clients and society value expertise and trust, combined; sometimes, trust goes up when the expertise is black-boxed. Do we value the gardener for the gardening or for the garden?