Un-Jailbreakable AI Models Aren’t a Thing – But Here’s the Best Approximation

Why neural-symbolic and open is the best way to address the model-jailbreaking (capability-elicitation attack) problem…

Jun 18, 2026

Following the ongoing Anthropic / US-government saga, the latest is that the government has said Anthropic can open Fable up to the world again — if it makes Fable un-jailbreakable.

Now, un-jailbreakable is not really a thing. It’s a rhetorical notion, not a technical one. I’m sure the Trump administration knows, on some level, that you can’t drive the probability of every abuse down to zero — any more than you can make your house truly impossible to burgle. Even Alcatraz turned out not to be literally un-jailbreakable. What you can do is push the risk down as low as possible and hold it there under adversarial pressure.

Anthropic, of course, responded to the White House correctly on this point: perfect jailbreak resistance isn’t a thing that exists. The real question is how you get close.

What I will argue here, though, is that the best way to get close to the unattainable ideal posed by the White House is NOT what Anthropic thinks. I.e., it’s not to keep building closed, centralized LLM systems and try to protect them with fancy system prompts and constitutional training. It’s

to build neural-symbolic systems — systems that combine LLMs with formal verification and uncertain logical reasoning — and
to do this in the context of open-source, or at least open-weights, neural models.

As you know if you’ve been following me at all, I have deeper reasons for pursuing neural-symbolic AI, and for supporting openness, which aren’t really about jailbreaks at all. I think neural-symbolic AI is how you get to human-level AGI faster – and how you get a human-level AGI that’s more smoothly upgradable into superintelligence. And I support openness because I think open AI can be smarter, its values can be more inclusive and representative of the human race as a whole … and it’s safer against catastrophe for our species.

But one thing this Fable episode highlights is that open and neural-symbolic has another, related advantage: it can be easier to defend against jailbreak attacks.

That’s the case I’m going to lay out in this post….

I’ve also put together a rough technical paper laying out the architecture that I’ll sketch here — it’s linked here.

Let me also say this: While the thrust of my own work is open and neural-symbolic, many of the ideas I’ll describe here could be applied in a centralized proprietary frontier lab context too, and as a current avid user of Claude and OpenAI and Gemini I would love to see all these models become more secure. I’d be more than happy to collaborate with Anthropic and the other big labs on this sort of security work, which in many ways can cut across differences in AI architecture — more on that at the end. But let me first sketch the architecture I have in mind…

What does “jailbreak” even mean?

There are really two different kinds of cyberattack hiding under that one word; let me start by trying to properly distinguish them.

The first is what you might call an authority attack — someone overriding the model’s instructions to take control of it. A prompt-injection attack is the classic example. This isn’t always easy to protect against, but there are working approaches, and there are more sophisticated ones too, like the MultiPPAC architecture I proposed a while ago with Paulos Yibelo. Basically you do a bunch of architectural things to minimize the odds that untrusted data ever gets to play the role of an instruction. Super important to deal with — but it’s not the main topic here.

The main concern, in the Anthropic context, is what you’d call a capability-elicitation attack. Here you’re not messing with the instructions at all. The model is following its instructions correctly. It’s being helpful, like it’s supposed to be. But in the course of being helpful it produces a dangerous artifact — and at the shallow level everything looks appropriate.

The right way to think about this is not as one nasty prompt that slips something through. It’s as a violation — an undesired capability getting elicited — somewhere in a whole deployment trace of the system.

And you have to look at it at the level of a whole adversarial campaign, because any single prompt may do nothing bad, but a series of two hundred prompts might – and if someone’s firing off thousands of prompts, all you need is a little of the wrong stuff to leak out one time in ten and, across the whole campaign, bad stuff gets done.

It’s worth being quantitative about this: if a single attempt slips through with probability p, then over Q roughly independent attempts the campaign succeeds with probability around Q times p — so a one-in-a-thousand per-prompt leak becomes a near-certainty for an attacker willing to iterate a few thousand times.

So clearly, any serious security claim has to be made at the level of campaigns, not prompts. What we’re defending against isn’t a bad prompt; it’s complex patterns of activity that elicit capabilities we wanted to keep gated.

Generate, then verify

The clearest way I see to handle capability-elicitation attacks is to separate generation from judgment.

Let the neural model generate — neural models are great at generating a lot of stuff. The hard part has always been constraining what they generate. That was the story with hallucinations (a problem that’s slowly receding), and it’s the same story with capability elicitation: the model generates a pile of stuff, and some of it might be stuff you specifically didn’t want generated.

So you let the neural net do its generation — but if your meta-system flags the output as possibly risky, you don’t send it to the user right away. You generate it into a quarantine, and you hand it to a judgment process that’s deliberately more conservative. The judgment process asks: what does this produced artifact actually enable? And: is this user allowed to receive what it enables?

The key move is that the judgment process should not just be another LLM — or you’re basically right back in the same gaming problem you started with. It should be a more precise, rigorous logic process. Actually several of them, for different purposes.

The first step is to take the output and normalize it — map it into a capability-safety-oriented intermediate representation (IR). You’re mapping into a kind of formal data structure that says what’s there: what does this artifact do, what resources does it use, and, importantly, what couldn’t we figure out about it? If it’s just a textual answer, that’s a logical representation of the text and its meaning. If it’s a program, it’s subtler but in a way easier — you’ve got the source, you’ve got which libraries it depends on, and from that you can infer what effects running the code might have in different situations.

And you don’t want this IR produced by a single logical oracle. You want a diverse ensemble. You can use a bunch of different LLM auto-formalizers to turn the raw output into formal structures. For program code you can run static analyzers over the data flow and the API usage. If it’s executable, you can actually run it in a sandbox. You can do various kinds of formal analysis, and you can scan for obfuscated stuff hidden inside the output.

The safety logic to use here is: any one analyzer can raise a flag and say “there’s high-risk behavior here,” but no single analyzer gets to wave something through just by failing to find a problem. Danger is unioned; safety is intersected; silence is never evidence of safety. And because the verifier only ever sees the formalized version, you also check that the formalization is faithful — coverage, round-trip reconstruction, actually-run-it witnesses, and treating anything you couldn’t analyze as risk rather than as safe — so the failure mode becomes a refusal, not a confident mistake.

There’s no 100% guarantee that you’ll capture the artifact’s true intent and behavior this way. But the more different views you have, and the more you’re actually running code or driving out the logical consequences of generated math, the more likely you are to be getting at what’s really going on.

And the judgment layer doesn’t have to give a flat yes or no. It can tweak what gets released. Take the actual Fable scenario: a user uploads a codebase and asks for a security fix — which is, of course, sometimes the flip side of asking for a hack. The judgment layer doesn’t have to choose between “no, I won’t” and “here’s everything.” It can release the security fix and a regression test, while withholding the step-by-step sequence of exploits that the fix implies and that would double as a working attack. It can ask the user to verify they actually own the repository before handing over anything that could be turned into an exploit. And the transformed output gets re-verified before it ships.

The generate-and-verify pipeline. Blue is neural (the LLMs that generate and describe); teal is the verification ensemble; orange is PLN and the release decision; gray is provenance, grants, and policy. Neural generates; symbolic judges.

Why neural-symbolic

The reason this judgment step is genuinely hard — especially if you’re only using neural nets, and even more so if these are closed and proprietary — is that it has to fuse two completely different kinds of evidence.

On one side you have deductive facts about what a piece of code or text does — things you can establish by running the code, doing type-theory analysis, or some rigorous logical reasoning.

On the other side you have fuzzy, messy, probabilistic evidence about who is asking this, and why. And the catch is: the evidence about context can’t just come from the prompt — because the prompt can be the hack. So you need provenance: verified identity, demonstrated control of the codebase or other asset. And then you need high-level contextual reasoning to put the provenance together with the facts about what the artifact does, in a way that’s context-appropriate and yet still rigorous. That’s an uncertain-reasoning problem, combining “common sense inference” with formal reasoning about structured artifacts, code behavior and so forth.

Now, in principle a neural net could do whatever uncertain reasoning you like. But today’s transformer models don’t do this. They don’t hand you a transparently inspectable, calibrated, rigorous logical model of “ what does this artifact do, and who’s wanting it to do that, and why.” They give you something much bigger and messier, spilled across a huge number of parameters — which is a lot more confusing, and a lot more hackable. That’s why you want a symbolic layer.

The symbolic layer takes capability claims, policy rules, the outputs of the different formalizers, facts about provenance, the history of previous queries — and merges them all in an uncertain way. Something like Hyperon’s PLN (Probabilistic Logic Networks) can give you a probabilistic risk score with a confidence attached — or a whole probability distribution — plus a detailed inference trail of why it assigned the score it did. The simplest form of PLN keeps two numbers rather than one, a strength and a confidence, so it can tell “probably safe, on strong evidence” apart from “probably safe because nobody actually checked” — which is the distinction that lets you fail closed honestly. This is expensive, so you need careful attention allocation; in Hyperon that’s the ECAN economic-attention-allocation mechanism working alongside PLN.

My first instinct is to do all of this with Hyperon, naturally. But there will be other neuro-symbolic approaches, and that’s fine — you can build benchmarks and metrics for the symbolic part of this architecture and evaluate PLN against the alternatives. The point isn’t PLN specifically; it’s that the verification, and the contextual information-fusion, want to be symbolic.

Why openness helps minimize damage from jailbreaks

It’s often been said that open can be more secure than closed — more eyeballs on the bugs. What we have here is a very rigorous and precise special case of that idea, and I think it’s more compelling than most.

The first point: if you have an open-weights model — a GLM or a Kimi instead of a Claude or a GPT — then the verifiers can look inside the model. They can probe activations, read representations, and so on. Now, the attacker has the weights too, so they can tune their hacks against a neural probe — but you can take some bite out of this with checks like actually running the code in a sandbox, which a model can’t talk its way out of.

Of course, a closed company can make its own verifier probe its own activations. But here’s the thing, and this is really the strongest argumen in the “open” directiont: if you want a diverse ensemble of verifiers and formalizers, an open ecosystem can have that done by a huge number of different parties, coming from different perspectives. That simply never happens in a closed system.

An open ecosystem produces diversity and independence as a byproduct. Different teams, different methods, different domains, different failure modes — a whole diversity of specialists. One verifier for Solidity exploits, one for wet-lab protocols, one for crypto protocols, each written by people who actually live in that niche. No one company can staff world-class verification for every dangerous domain. Our whole global tech ecosystem, at least, can take decent shot. And then PLN can combine all of these inputs.

And the reason this matters isn’t just “more verifiers”: a stack of verifiers all trained inside one company shares the same blind spots and fails together, which is exactly how adaptive attackers beat ensembles. Independent failures are what make the any-one-can-veto rule strong — and independence is the thing an ecosystem manufactures and a single company structurally can’t.

There’s a pretty asymmetry buried here.

The LLM itself is trained on the whole web, so the risks and dangers it can produce span all the knowledge on the web.
But the experts inside Anthropic or OpenAI or Google are not experts in everything on the web. They can grab verification tools online and integrate them, but that’s never going to be as good as the actual work done by the actual experts in each domain

The diversity of verifiers that PLN needs to be fed is something a large open ecosystem can provide far better than any one company, however brilliant.

So – going back to the big picture – the reason for AGI to be neural-symbolic and open isn’t just about maximizing intelligence. It’s also about maximizing security.

The mythology behind something like the Fable shutdown is that safety lives behind closed doors and openness is a risk to be managed. What I’m arguing here is closer to the opposite: unless we’re just going to kill off all the powerful AIs altogether, you get more safety out of being open, and out of being decentralized, not less.

But is open really safer? A rigorous decision spine

It’s worth trying to be more precise about the “open is safer” conclusion here – because while there’s a simple philosophical/emotional center to the notion, the realities are actually pretty subtle. In the context of jailbreaks, as in the general AI safety context, whether open or closed is safer is not a crisp objective fact about open versus closed in the abstract. It’s a function of a variety of interrelatedconditions — about the capability, about the world it’s being deployed into, and about the gates you can build around it. Change the conditions and the safer choice changes with them. So let me lay it out the way I laid out the broader “openness and AGI-catastrophe” question in an earlier post — as a set of hinges and a decision diagram.

The thing every path toward approximating un-jailbrealability has to land on — the floor, we may call it — is this: the dangerous capability doesn’t disappear, so unauthorized release can only ever be bounded by a gate that somebody actually runs. No choice of open or closed gets you out of needing the gate. What the choice decides is who runs the gate, who can audit whether it’s working, who can step around it, and who can repair it when it cracks.

The open-versus-closed decision spine for a single dangerous capability. Amber boxes are hinges (questions that route the case); teal boxes are leaves where open is safer; the gray box is the one leaf where closed wins; the red box is the tail case no architecture makes easy; the green box is the floor every path must reach.

Here are the hinges on which, in our analysis, the decision of “is open or closed safer in terms of jailbrealability” relies:

Containable? Is the dangerous part bound up with scarce, hard-to-rederive knowledge you could actually withhold — or is it latent in the weights and re-derivable from public knowledge, in which case nobody can contain it, open or closed?
Armed anyway? Does a determined adversary get the capability regardless of your choice — because a comparable open model already exists, or weights leak, or it’s re-derivable? If yes, closing your model doesn’t disarm offense; it only changes how many defenders get the tool.
Defense balance? Is the capability defense-symmetric — does the same skill help defenders as much as attackers? And is the harm diffuse (lots of small attacks) or catastrophic-tail (one irreversible success)?
Gate reliable? In the rare case you really are the only source, is your one closed gate both jailbreak-robust and politically stable — or can it be bypassed, leaked, or switched off by a directive?

Walk it through. If the dangerous increment is genuinely scarce and separable, you don’t need closure — you open the base model and the gate for everyone to inspect, and you withhold or distribute the scarce piece (this is where cryptographic laterality comes in, more below).

If it’s re-derivable, containment is off the table for everyone, and the question becomes the offense/defense balance.

For a defense-symmetric capability with mostly diffuse harm — which is where software exploitation actually sits — the open, gated-public-deployment path wins cleanly: defenders sail through the gate because their provenance checks out, while attackers have to pay the cost of standing up their own copy, so the cheap default path serves defense.

The one branch where closed wins is a genuine capability monopoly with a gate that’s both un-bypassable and politically stable — and the Fable episode is pretty good evidence that, in the real world, neither of those holds: there are several capable open models, and the gate got bypassed and then yanked by a directive.

And there’s a painfully difficult branch — a capability that’s purely offensive and irreversible, engineered biology being the obvious case — where no architecture makes the floor easy.

So the grand conclusion isn’t that openness is unconditionally safer. It’s that under the conditions that actually hold today, the cases that matter route to the open-and-gated leaves, and closure mostly gets you to the same floor by a worse road.

Open weights secure a deployment, not the whole world

The obvious objection to this pro-open argument is: Open weights secure a deployment, not the whole world.

Anyone can download the model and run it raw, with no gate at all. That’s real.

I have a couple of useful, adjacent things to say on this before giving my main response.

First, the same framework wraps just as well around a proprietary model as around an open one — it’s a kind of security library, and I’d love to see it made open and deployed both inside the major corporations and across the major open projects. The gate is portable; the question of who runs it is separate from whether the weights are open.

Second, if you want to actually solve the “bad guys just download the whole open-weights model” problem, that’s a different and harder issue, and I think the real solution is to move beyond these big centralized models trained on backprop. In my post on cryptographic laterality I explained how you’d take a PLN inference engine and splay it across a whole network using multi-party computation and secret sharing, so it’s hard to steal the full state of the network. You can do the same kind of thing with transformers, and with predictive-coding neural nets — break the network into pieces spread across many machines, with sharing only between nearby pieces, so you’d have to copy the whole network to get the full intelligence out of it. But I’ll be honest that this one is an aspiration; it’s not how things work right now. The safety mechanisms I’ve described in this post, by contrast, work right now.

But the main point for now is:

Gating the main public deployments doesn’t stop a determined bad actor from spinning up their own raw copy — but it doesn’t need to, because that determined bad actor was going to get the capability one way or another anyway: from some other open model, or by re-deriving it, or by stealing weights. Closing the model never disarmed them in the first place.
What gating the public deployments does do is shape where the overwhelming bulk of the usage flows.

A legitimate user — a defender hardening their own system — sails through the gate, because their provenance checks out, so the cheap, default, high-volume path serves the good uses.

The bad actor has to leave that path and pay the cost of standing up and running their own ungated deployment, which most simply won’t bother to do.

So the weight of the world’s actual usage of the capability tilts heavily toward good rather than bad — and because this is a defense-symmetric capability, where the same skill that finds a vulnerability is the skill that patches it, all that lopsided good usage compounds.

Defenders are running the tool over their own systems continuously, ahead of the attackers, while a handful of self-hosted bad actors are left hammering on targets the defensive majority has already been busy fortifying. That’s how you get defenders outrunning attackers in the end — not by keeping the capability out of bad hands, which can’t be done, but by making sure the good hands are far more numerous, far better resourced, and operating by default. Keep the capability gated where most of the world meets it, and the cumulative weight of defense pulls ahead of offense over time.

Why this makes more sense for software than for bio

You can see, though, why I lean on this argument for cyber but not for everything.

In software, the harm is mostly diffuse and, importantly, recoverable — a breach gets detected, a system gets patched, the same capability that finds the hole is the one that closes it, and every attack you survive leaves you a little harder to hit next time. That’s exactly the setting where lopsided good usage compounds and defense pulls ahead.

Biowarfare is the opposite case on both counts: the harm is offense-dominant and irreversible. You can’t patch a released pathogen, the attacker needs to succeed only once, and there’s no “survive it and harden” loop to ride — so no volume of benign use on the good side outweighs a single catastrophic success on the bad side. The defense symmetry breaks down too: engineering a pathogen and defending against one are not the same skill running at the same speed, the way finding and fixing a bug are.

So for the genuinely offense-dominant, irreversible domains I don’t pretend this argument carries over so nicely — that’s the hard tail where no architecture, open or closed, makes the floor easy, and I’d rather say so plainly than wave it away. For the biowarfare case, we need different approaches as I’ve sketched in a previous post.

A beautiful case for collaboration

Which brings me back to where I started. This really seems to me like a beautiful case for collaboration. In my own Hyperon project we already have a solid probabilistic inference framework and a bunch of formal verifiers wired into it. It would be genuinely fun to work across project boundaries — including open-source projects like ours and the Chinese ones together with the big proprietary LLM labs — to build a neural-symbolic capability-gating framework that everybody can use. This should be a generic software library, deployable for proprietary or open systems alike, and something the world as a whole collaborates on for the good of the world as a whole.

I’m an open-source, decentralized-AI booster, and I’ll almost surely stay one. But I can see the value the proprietary models have given, too, and right now I think we want to boost security for everything. So anyone at a power position in a frontier lab reading this, consider this an open invitation! Bring your hardest capability-elicitation cases; we’ll bring the symbolic reasoning. Let’s work together to make all our AI systems as un-jailbreakable as anything real ever gets!

Alex Tolley

Asimoc anticipated the problem in his robot novel, "The Naked Sun". How to make a robot that cannot break the first law of robotics: "Do not harm a human being", kill a person. It was done by giving robots different, innocuous commands that, in aggregate, did kill a human.

We know from human security systems that even security in depth can only do so much. Whilst one might want to drive down the probabolity to very low values, there are limits. As we have seen with even simple things, like toy chemistry sets, over my lifetime they are dumbed down to absurdly low levels to prevent any harms, pretty much destroying the value of such a product for education.

The problem will manifest itself in any number of domains. Producing weaponized bio agents. One might have to restrict access to DNS databases, PCR chemicals and equipment, and so on. It can be done, but it will wreck the ability of non-malicious actor s from doing any interesting research.

Criminals will always find a way to bypass controls. Authorities can only try to lock down access ever more tightly. But at some point, it cripples the value by trying to protect against the bad. We have examples of that from the reduction of nuclear weapons. Fissile material was diverted or stolen. Bombs can be designed with available information and knowledge.

By all means, find ways to try to defend against jailbroken AI, but I suspect it not only cannot be done, but that the systems designed to do so will prove more of a burden on the target.

Nicholas j Bogaert

Also, this is exactly the direction the field has to move.

LLMs are not the whole intelligence stack. They are the fluent generation layer. The real safety problem is what happens around them: memory, provenance, authorization, tool control, drift monitoring, audit trails, and a gate that fails closed when capability exceeds trust.

That is why the “wall” approach was never going to be enough. You do not solve AGI safety by locking the model in a tower or pretending dangerous capability can be erased from the world. You solve it by building lawful boundaries around release, execution, identity, and memory.

AI.Web has been working from that premise for a while now: local compute, local memory, local identity, symbolic verification, trace-level monitoring, and what we call the Genesis Node.

My view is simple: the AGI-level mind should not live loose inside a moving robot body. The robot should be an interface. The deeper intelligence should stay in a stable sovereign node, owned by the user, auditable, gated, and grounded. The body requests. The node governs. Pandora’s box stays on the table.

For anyone trying to understand the larger frame without chasing ten scattered posts, I laid it out here:

https://nicholasjbogaert.substack.com/p/the-wall-was-never-the-plan?utm_source=share&utm_medium=android&r=5zf6op

No accusations. Ideas converge. But some of us have already been building the box the rest of the field is starting to describe.

3 more comments...

Eurykosmotron

Discussion about this post

Ready for more?