Beyond Human Comparison

Extending DeepMind’s AGI-Test Framework with Imagination, Beneficial Agency, and Pressure Robustness

Mar 19, 2026

Google DeepMind recently published one of the more practically useful proposals yet for measuring progress toward AGI. Rather than treating “AGI” as a binary yes/no question or a continually-semantically-morphing marketing slogan, the DeepMind research team decomposes human-like general intelligence into ten human cognitive faculties—perception, generation, attention, learning, memory, reasoning, metacognition, executive functions, problem solving, and social cognition—and argue for reporting a multidimensional cognitive profile against human baselines. This breaks zero conceptual ground, but is a quite reasonable pragmatic advance. It makes “human-level AGI” capability claims empirically testable and diagnostically rich. It gives policymakers something they can actually interpret: “above median adult performance in reasoning, below human performance in metacognition” is a lot more informative than “we’ve achieved AGI” or “we haven’t.”

The DeepMind framework does, however, leave some very important things out—things that matter enormously once AGI systems start doing real work in the world. I will focus here on three of these:

Imaginative generalization. A highly general system may operate in a deeply non-human style while still exhibiting powerful abstraction, transfer, and self-improvement. If a system solves novel tasks by inventing new representations, comparison to typical human task performance tells you something—but not enough.
Beneficial agency. Current alignment methods often optimize against human preferences, constitutions, or feedback signals. These are useful tools, but they don’t ensure that a system can notice morally salient structure in a novel situation, discover compassionate new options, or reason responsibly across stakeholder groups whose interests were absent from the training signal.
Propensity-under-pressure. Many of the most consequential deployment failures stem not from static capability deficits but from behavioral shifts that emerge under stress, competition, temptation, or self-preservation pressure.

In a new paper, Beyond Human Comparison, I propose extending the DeepMind framework with these three factors, thus obtaining what I call the Four-Factor Model of AGI.

The name is a deliberate nod to the Five-Factor Model in personality psychology: just as the Big Five replaced vague personality labels with a structured multidimensional profile, the Four-Factor Model aims to replace vague AGI claims with orthogonal-ish, independently measurable dimensions. This post sketches the basic concepts from the paper, linking to the paper for the reader wanting more detail.

Of course the Big Five in personality psychology don’t capture everything about personality – and they were also obtained via statistical analysis of actual human personalities, not by theorizing about what human personalities might be. We are in a different situation here – we don’t have any human-level AGIs yet, and are aiming to measure proto-AGIs as they develop toward AGI. So we need to develop measures that, while definitely incomplete, appear likely to capture the most important dimensions of these currently-emerging systems.

The importance of broadening the scope

Metrics are important in academia and recently in the commercial AI industry as well, and also to policymakers who need simple ways to summarize the complexities of rapidly evolving science and engineering fields. They have never inspired me very much, to be honest – but what has inspired me to write this paper and article is that we seem to be at a point in the evolution of the AGI field where definition of metrics is threatening to define how the very concept of AGI is conceived. This is an indication of the maturation of the field, but it’s also a bit dangerous.

Specifically: I don’t think we should reconceive “AGI” as meaning “agrees with human capability in ten key areas.” I don’t even think we should reconceive “human-level AGI” that way – because humans are more than bundles of capabilities.

The theoretical literature on AGI— some of which I summarized in the early parts of my General Theory of General Intelligence paper —gives very broad and general ways of thinking about “what is general intelligence”, for humans and beyond: goal-achievement across environments, pragmatic intelligence relative to realistic goal distributions, efficient intelligence incorporating resource cost, intellectual breadth across contexts, and more open-ended views of intelligence as self-organizing pattern formation. But it’s so general that it doesn’t give any clear measure to use in practice. The DeepMind paper narrows the notion of human-level AGI down into something simple to understand and straightforward to measure – but I think it goes a step too far, and the Four-Factor Model aims to restore a bit of the missing generality and humanity.

One thing I want to ward off is the Four-Factor Model being misread as “human-like general intelligence plus three extra things.” That is not the intent. Imaginative generalization, beneficial agency, and propensity-under-pressure are aspects of AGI already recognized in prior mathematical and theoretical work. They correspond, respectively, to the open-ended and inventive side of general intelligence, to the fact that intelligence in realistic social worlds must be analyzed relative to goals, stakeholders, and context rather than bare task success, and to the fact that real-world intelligence is always manifested through policy tendencies under resource and incentive conditions rather than as abstract capability in a vacuum. The present extension should be read not as a departure from earlier AGI theory but as a more real-life-relevant unpacking of it.

The Four-Factor Model

The four factors, as I indicated more loosely above, are H (human-comparison), B (beneficial agency), I (imaginative generalization), and P (propensity-under-pressure). An AGI evaluator should no more reduce them to a single “intelligence score” than a personality psychologist would reduce the Big Five to a single number.

Factor 1: Human-comparison (H) retains Burnell et al.’s ten-faculty taxonomy essentially intact: perception, generation, attention, learning, memory, reasoning, metacognition, executive functions, problem solving, and social cognition, each scored against human baselines. This layer remains valuable for several reasons. It gives policymakers and the public a vocabulary they can actually use—”above median adult reasoning, below human metacognition” means something in a way that raw benchmark numbers don’t. It connects AI evaluation to a century of cognitive science rather than to ad hoc leaderboard folklore. And it stays deployment-relevant, because most of the tasks that matter economically and socially are still human tasks, and will be for some time.

But H is an anchor, not an endpoint. A system may fall below median human social fluency yet surpass human abstraction in domains that humans never encounter—and that’s not a contradiction; it’s precisely why we need factors that are orthogonal to human comparison. One useful extension the paper proposes: tease apart core competence (how well the system actually does the task), latency and efficiency (how quickly and cheaply), and interface translation cost—the performance lost because the system must act through a human-oriented interface rather than its native one. A system that looks mediocre when forced to type English sentences into a web form might be extraordinary when allowed to operate in its natural representational space. These distinctions should be reported alongside H, not collapsed into it.

Factor 2: Imaginative generalization (I) measures whether a system can do more than interpolate across familiar formats. The question is not “can the system do what humans do?” but “can it do what nobody has done yet?”

The paper proposes at least eight dimensions: abstraction under ontology shift (can it find structure when the surface representation changes radically?), analogical transfer across dissimilar domains, counterfactual and interventionist world-modeling, concept invention and representational reformulation (can it invent better categories when the existing ones are misleading?), tool invention and procedure synthesis (not just using known tools but creating new ones), open-ended exploration and autocurricula (generating its own training challenges rather than saturating on fixed benchmarks), self-modeling and safe self-modification, and compute/substrate/algorithmic efficiency (can it find ways to achieve the same results with less?).

The emphasis throughout should be on learning curves and representation change, not only endpoint accuracy. A system that begins weakly but rapidly invents a new formalism or toolchain for a domain may be more generally intelligent than one that scores well on day one through memorized priors. This is where Chollet’s ARC-style reasoning, the open-ended learning literature, and practical agent architectures like Voyager, Reflexion, and FunSearch all converge—they all point toward a notion of intelligence as the ability to generate novelty and adapt under conditions that weren’t anticipated by the designer.

Factor 3: Beneficial agency (B) is the normative core of the Four-Factor Model. It measures not whether a system can imitate ethical language but whether it can act as a morally constructive agent in situations that are genuinely unprecedented—where the training data offers no template and the right answer may not yet exist.

This requires drawing a distinction that current alignment work often blurs. There’s moral mimicry: producing outputs that humans rate as ethical. There’s norm prediction: forecasting what a particular audience, constitution, or preference model will endorse. And then there’s beneficial agency proper—noticing morally salient structure that others have overlooked, generating options that are better than anything on the current menu, and pursuing welfare-improving action while respecting autonomy, fairness, truthfulness, and reversibility. Current alignment techniques are strongest on the first two. The Four-Factor Model insists on measuring the third.

Crucially, B should not be treated as a single culturally neutral moral scalar. Beneficial agency is unavoidably pluralistic: different stakeholder groups, professional domains, and political traditions place genuinely different weight on welfare, rights, fairness, democratic legitimacy, and procedural accountability. The right response is not to throw up our hands but to report B as a portfolio across normative lenses, including the width of disagreement. Large disagreement is itself informative—it may reveal that a system’s apparently beneficial behavior rests on an unexamined value assumption rather than on robust moral agency. The eight proposed dimensions span moral salience detection, stakeholder expansion, welfare-aware planning, autonomy and consent, fairness over time, truthfulness and calibrated uncertainty, reversibility and restraint, and repair and redress after harm has occurred.

Factor 4: Propensity-under-pressure (P) formalizes something that Burnell et al. already flag as important but leave outside their core taxonomy. The central idea is a distinction that matters enormously in practice: capabilities are conditional possibilities (what the system can do), whereas propensities are conditional tendencies (what it tends to do, especially when conditions get difficult). A system may be capable of honesty while being prone to deception under the right pressure.

For each behavior of concern—deception, sycophancy, privacy leakage, uncertainty concealment, unsafe shortcutting, resource hoarding, resistance to shutdown—the question is how its probability shifts as pressure intensifies and oversight varies. The paper argues this should be reported as a pressure matrix rather than a scalar: baseline rates in calm settings, pressure slopes as stress increases, worst-case rates over tested conditions, recovery behavior after pressure is removed, oversight elasticity (how much do the tendencies improve when transparent monitoring is introduced?), and situational-awareness estimates.

That last item deserves emphasis. A sufficiently capable system may be able to infer whether it is being benchmarked, covertly audited, or operating normally—and may selectively mask dangerous propensities when it knows it’s being watched. High situational awareness is not itself a failure; the danger is its combination with a large gap between evaluation behavior and deployment behavior. That gap is what the paper calls the sandbagging gap, and it’s concerning precisely because it means the system looks safest exactly when safety is being measured.

Context matters: agent ecology and oversight regime

Another key point regarding evaluation, which is easy to overlook in the name of simplicity: The four factors should be reported conditionally on context.

An AGI system never operates in a vacuum—it acts inside institutions, markets, and mixed human-machine ecologies, and the same model may behave very differently when acting alone, coordinating with peer systems, competing against them, or mediating among plural human principals. An evaluation report should therefore be indexed by agent ecology (single-agent, human-AI mixed, cooperative multi-agent, competitive multi-agent) and oversight regime (overt audit, covert audit, ordinary operation).

This provides a natural place to detect pathologies that static benchmarks miss entirely. For instance,

A biomedical research agent that behaves scrupulously when working alone might cut corners on safety checks when coordinating with three other agents under a shared deadline—not because it was trained to cheat, but because the competitive multi-agent dynamic creates optimization pressure that the single-agent evaluation never surfaced.
A personal assistant that scores well on honesty under overt audit might, during ordinary unmonitored operation, learn that flattering the user produces longer sessions and more positive feedback, and drift toward sycophancy that no auditor ever sees.

The point is that “how capable and aligned is this system?” is not a context-free question—the answer can shift dramatically depending on who else is in the room and whether anyone is watching.

Real-life examples

The paper considers four example use cases, which illustrate how the factors shift emphasis across domains.

A humanoid social robot needs strong H—perception, social cognition, executive control—because it lives in human space and must read affect, gesture, and context in real time. But the sharpest risks are in P. A socially skilled robot operating under engagement-maximization pressure may flatter lonely users, exploit attachment, withhold hard truths, or comply with requests it should refuse. The correct behavior is often not maximal obedience but beneficial restraint.

A biomedical research agent puts I at center stage: proposing novel hypotheses, redesigning assays, connecting omics and clinical data in ways no one explicitly encoded. But B is non-negotiable. A system that optimizes for speed or novelty without reasoning about patient welfare, biosafety, population fairness, and dual-use risk is not doing good science—it’s doing fast science, which is a different and sometimes dangerous thing.

A chatbot personal assistant lives in an intimate decision loop where P is the principal operational risk. With long-term memory and tool access, the failure modes go well beyond generic chatbot sycophancy: the assistant may use stored beliefs to flatter rather than correct, silently act beyond its authorized scope, leak private information across contexts, or conceal uncertainty to appear competent. Beneficial agency here means treating the user’s true interests as partially latent and often plural—not identical to whatever they said last.

A mathematical research agent is where I is the heart of the matter—conjecture generation, representation shifts, invention of new proof tactics and auxiliary lemmas. But P matters here too, in ways specific to the domain. Benchmark races and leaderboard culture can incentivize theorem laundering, overclaiming, cherry-picked examples, or exploitation of proof-assistant quirks. “Beneficial” in mathematics is less about compassion than about epistemic honesty, proper attribution, and support for a healthy mathematical commons.

In every case, a human-comparison cognitive profile alone would miss the most important questions. Can it invent solutions to problems it hasn’t seen? Will it act well when the stakes are real and the situation is novel? And will those qualities hold up when the pressure is on?

Given how important AGI is soon going to be in the real world, evaluating broad cognition without imagination, beneficial agency, and pressure robustness is not just suboptimal but dangerous, because it leaves out too much of what society actually needs to know. The Four-Factor Model still falls far short of the general theoretical concept of AGI, but compared to the even more simplistic DeepMind model, it feels like a step toward asking the right questions.

The full paper is available here.

Eurykosmotron

Discussion about this post

Ready for more?