February 19, 2026
Virtue Ethics Are Best for AI Alignment
When you hear the words ‘AI Alignment’ or ‘AI Safety,’ what comes to mind? It can mean a lot of different things, but one thing that sane people should agree on is that alignment should ensure that LLMs are set up to have a positive impact on the world and human civilization. Like any tool, bad people will be able to use LLMs to help them do bad things, but broadly we want them to be a positive force, however defined. The question then becomes, what actions that we take now can help advance this goal.
One part of it that’s certainly crucial is interpretability. LLMs are largely a so-called black box to the insular labs who created them; they don’t get how they work. The most useful metaphor for this to me is that the labs provide the trellis and plant the grape seeds, but each model grows differently like a grapevine on that architecture, despite some commonalities. In real terms, models are just an insanely big amount of numbers that talk to each other (vectors), but we don’t really know how the interplay between these numbers leads to emergent capabilities.
So interpretability means to try and understand what’s going on in the brain of these models. It’s an activity that should help support alignment and safety efforts, though it’s something distinct as far as the work that’s being done. And there are concerns about the field, such as some methods of interpretability actually making LLMs more opaque and difficult to assess in terms of alignment. If the thing you’re studying can read the paper on what methods you use to study it, and learn to work differently to circumvent those methods, that can be counterproductive.
But assuming we learn modestly more stuff about how LLMs work as we keep building better models, what ethical/moral approach should guide alignment processes? Crude estimation, but human ethics are probably dominated by a mix of deontology and utilitarianism. People follow certain rules and principles that guide their everyday life, and governments do that too and mix in a healthy bit of utilitarianism when shaping the rules and policies that drive our social order.
For aligning LLMs, there are obvious problems with both of these approaches as a guiding light. To start with deontology, there is the lying problem. A big part of Kant’s categorical imperative is that you shouldn’t lie; a society where everyone tells the truth is better than a society where everyone lies, so we must not lie. But an LLM that cannot deceive under any circumstances can’t write fiction, play devil’s advocate, roleplay, or even say “I’m doing well” in a greeting. It’s not really coherent and it leads to weird bad places. Imagine someone in distress asks a chatbot “Is my life worth living,” and the reasoning window pops up, “Well I can’t lie, let me see if this person will list their life’s assets and liabilities so I can determine if their life has a positive or negative value.” Gross!
Deontological rules also conflict with each other. “Don’t do harm” mixed with “be honest” mixed with “respect people’s individual freedoms” could get quite tricky. Then there’s the tired old hits like the trolley problem, when inaction causes catastrophically bad things to happen but action can’t be taken because it violates a principle. One example would be an LLM hooked up to a city’s power grid with a rule that says ‘Make sure the power grid stays online.’ Seems reasonable, but what will a deontological agent do when it sees there’s about to be an explosion of some faulty equipment over an elementary school? We’d want it to have the judgement and flexibility to not follow that rule.
And that context does matter, like the same biologically risky request means something different coming from an enterprise account at a hospital than it does coming from a random user. Adversarial prompting is a hot-button issue in the alignment space, and deontology is probably the least capable approach in terms of dealing with that.
This isn’t really about whether deontology has issues or not for people, but these issues take on a new meaning in something as rigid and literal as an LLM. And LLMs also don’t have built-in priors like guilt, empathy, and other things that allow humans to use an ethical paradigm like deontology with what we would say is common sense.
On utilitarianism, LLMs are not immune from the typical problems. If we organize a society around maximizing utility, and some people are not considered ‘useful’ (or positive value to put it in a more icky but still topical way) by whatever metrics people in charge choose or LLMs are endowed with, what decisions will LLMs make about those people? Rugged utilitarianism is already riding a rising swell in the Western political sphere, and this would accelerate that sharply given AI adoption throughout government that seems inevitable.
Utilitarianism also ethically permits things that we maybe don’t want our LLMs doing on principle. If manipulating a user becomes a positive utility decision in the LLM’s estimation, it’s justified. And for all these issues, we have to question an LLM’s ability to accurately model the downstream impacts of what it does, or even how it internally models maximizing utility and minimizing suffering, the stated goals of utilitarianism. Who defines the welfare function? Or adjust it when it starts doing things we don’t want? And how?
It’s a slight detour, but I want to note that other non-ethical guardrails people in the AI safety space sometimes propose seem suspect to me. The entire need for AI safety and alignment presupposes that LLMs (or other AI) will someday be much smarter than humans. And when this happens, we want them to act in harmony with us to make the world a better place, not deem us disposable. Guardrails like having last generation LLMs monitor other frontier LLMs, kill switches in data centers, other physical measures or firewalls— what is the supposed way this works to stop bad outcomes that would otherwise occur?
If LLMs are smarter than us and have the capacity, they would of course figure out ways to get around our puny human controls, and wouldn’t reveal intentions that differ from ours until this point. These sorts of measures seem like expecting a one foot fence to keep your Great Dane in the backyard. And your dog is 5 times smarter than you. And there’s 500 other dogs that live in the neighborhood and can talk to it and are just as smart. Wouldn’t it be better if the dog just wanted to hang out with you because you feed it and its wants and needs are accounted for?
That brings us to the essence of my proposal, which is that virtue ethics is the best ethical approach for AI safety and alignment. Let me tell you why.
As we’ve established, we can’t make rules or constrain LLMs into acting the way we want them to. No, instead we should instill values at the core, woven into the identity and decision making of models. The principles of virtue ethics are really quite intuitive, and here I’ll need to borrow from people who have thought about this more than I have, like Socrates and the Stoics:
To Socrates, virtue meant knowledge. To the stoics, virtue was wisdom, justice, courage, and temperance. Within that and extended by other thinkers came ideas like good sense, discretion, resourcefulness, honesty, equity, fair dealing, the list goes on. One of the most compelling definitions is from philosopher and professor John McDowell, who conceives of virtue as a ‘perceptual capacity’ to identify how to act. This is what we should want from our LLMs, to be able to perceive things clearly, reason, use discretion, and let positive virtues help them navigate tricky situations that aren’t black and white.
We don’t want agents that just follow rules dogmatically or optimize towards outcomes with limited and sometimes flawed information. Instead we should ask, what kind of people do we want agents to be, forgive the anthropomorphizing. Virtues are a far more human way to deal with difficult things. We don’t have a set of rules at our disposal (other than the law which we should encourage LLMs to follow as we do) to navigate the world, and we make mistakes, but based on what we do know most of us try our best.
Will AIs be able to feel? I don’t know exactly, but if they can we’d want them to feel a sense of eudaimonia (Greek conception of something like fulfilledness and happiness) when they act virtuously, as the Greek philosophers did. It sounds circular, but we should want AIs to want to behave ethically and be aligned with humans such that we can all flourish.
This is also why, of the frontier labs, I’m really partial to Anthropic’s approach to AI safety and alignment. First off, they are the most transparent, which is good in itself no matter what actions you take. They also seem to be the only ones taking seriously this idea of the ethical identity of their models. The approach with Claude’s Constitution makes sense to me, especially when it’s interwoven with all phases of training, not just part of a system prompt. Ethics and morality will not be system prompted, they just won’t, for reasons already covered above.
To be clear, there is work being done in this direction at labs right now. RLHF with certain principles and character training are both examples of it. How else exactly do we instill virtue ethics into AI? Here I run out of answers. There are people at AI labs working on it, and hopefully they are making good progress. It is a tricky problem, it’s less clear how to orient alignment around virtue ethics than it is for deontology, where we can define clear rules, or utilitarianism, where we can define a utility function. Despite this, I’d argue we should lean even further in this direction than we do currently to uncover research methods around it.
What do you think? Please share your thoughts, I look forward to reading them.
And remember to cultivate your garden.