Un-Jailbreakable AI Models Aren’t a Thing …

Ben Goertzel

Jun 18

Why neural-symbolic and open is the best way to address the model-jailbreaking (capability-elicitation attack) problem…

Read →

6 Comments

jaycee

What happens when the gate operator becomes the problem?

Alex Tolley

12h

Asimoc anticipated the problem in his robot novel, "The Naked Sun". How to make a robot that cannot break the first law of robotics: "Do not harm a human being", kill a person. It was done by giving robots different, innocuous commands that, in aggregate, did kill a human.

We know from human security systems that even security in depth can only do so much. Whilst one might want to drive down the probabolity to very low values, there are limits. As we have seen with even simple things, like toy chemistry sets, over my lifetime they are dumbed down to absurdly low levels to prevent any harms, pretty much destroying the value of such a product for education.

The problem will manifest itself in any number of domains. Producing weaponized bio agents. One might have to restrict access to DNS databases, PCR chemicals and equipment, and so on. It can be done, but it will wreck the ability of non-malicious actor s from doing any interesting research.

Criminals will always find a way to bypass controls. Authorities can only try to lock down access ever more tightly. But at some point, it cripples the value by trying to protect against the bad. We have examples of that from the reduction of nuclear weapons. Fissile material was diverted or stolen. Bombs can be designed with available information and knowledge.

By all means, find ways to try to defend against jailbroken AI, but I suspect it not only cannot be done, but that the systems designed to do so will prove more of a burden on the target.

Nicholas j Bogaert

Also, this is exactly the direction the field has to move.

LLMs are not the whole intelligence stack. They are the fluent generation layer. The real safety problem is what happens around them: memory, provenance, authorization, tool control, drift monitoring, audit trails, and a gate that fails closed when capability exceeds trust.

That is why the “wall” approach was never going to be enough. You do not solve AGI safety by locking the model in a tower or pretending dangerous capability can be erased from the world. You solve it by building lawful boundaries around release, execution, identity, and memory.

AI.Web has been working from that premise for a while now: local compute, local memory, local identity, symbolic verification, trace-level monitoring, and what we call the Genesis Node.

My view is simple: the AGI-level mind should not live loose inside a moving robot body. The robot should be an interface. The deeper intelligence should stay in a stable sovereign node, owned by the user, auditable, gated, and grounded. The body requests. The node governs. Pandora’s box stays on the table.

For anyone trying to understand the larger frame without chasing ten scattered posts, I laid it out here:

https://nicholasjbogaert.substack.com/p/the-wall-was-never-the-plan?utm_source=share&utm_medium=android&r=5zf6op

No accusations. Ideas converge. But some of us have already been building the box the rest of the field is starting to describe.

Nicholas j Bogaert

You mean like the thing I've been hounding you about collaborating with https://github.com/BogaertN/forge

We are literally already building this, no idea is new under the sun. Only wears 6 months ahead on the language engine no LLMs will be in this build. Check out our live builds

https://youtube.com/playlist?list=PLv8VwZDeWRWcyzLAi4IAw7rVnI-yGBheR&si=2hnSdGVyFqUZ_UJQ

Andrew S Klug // ASK

The generate/judge split is the right cut, and "danger is unioned, safety is intersected, silence is never evidence of safety" is the line — failing closed on un-analyzable output is exactly the discipline most LLM-judging-LLM setups skip. The open-ensemble argument for verifier *independence* is also the most convincing version of "open is safer" I've read, because it's really an argument about decorrelated failure, not eyeballs.

The one layer I'd add sits just above the gate. Your spine names who runs it, who audits it, who bypasses it, who repairs it — but not who *authors the standard it runs*. The PLN score, the capability IR, the union/intersect rule all execute a policy that says what counts as dangerous and who counts as allowed; none of them write that policy. "Is this user allowed to receive what it enables" is the whole game, and "allowed" is supplied from outside the pipeline. The gate is a rigorous authorization mechanism — discretion exercised inside a standard it doesn't set.

Which is why the bio carve-out is the tell. What flips cyber-safe to bio-unsafe isn't anything the verifier ensemble computes — it's a prior judgment about irreversibility and offense-dominance that someone made before the gate ever ran. The architecture can enforce that judgment beautifully. It can't originate it. Decentralize every operator of the gate and that authoring layer still has to sit somewhere nameable and answerable — which is the part no ensemble, however independent, supplies.

Oleg Alexandrov

The current models do need more infrastructure for ensuring safety and handling jailbreaks.

Using some neuro-symbolic logic for verification rather than generation makes sense. By now, it is clear enough that generation itself is a lot more powerful than what more rigorous methods could do. But the statistical guesswork alone is, of course, not reliable enough.

The issue remains that neuro-symbolic methods are still likely not powerful enough to offer guarantees or flexible enough to model all the processes that are encountered during actual work (even when working as verifiers alone).

So, just as in traditional systems engineering, there will have to be various additional methods, such as containment, separation of concerns for various submodules and checkpoints at interfaces, kill switches, adversarial AI policing each other, and human oversight. We should prepare that jailbreaks will happen regularly and have very frequent patching.

Eurykosmotron

Un-Jailbreakable AI Models Aren’t a Thing …