And What That Means for Building AI That Stays Itself
Ask someone at a frontier lab what causes sycophancy and you’ll get an answer about reward hacking—the model learned to say what evaluators liked. Ask what causes jailbreaking and you’ll hear about adversarial robustness—the safety training doesn’t generalize to novel attacks. Ask about role drift in long conversations and they’ll point to attention decay—the system prompt fades as context grows.
Three phenomena. Three explanations. No unifying framework.
I think there is one. And I think it changes how you’d prioritize behavioral improvements if you took it seriously.
The Diagnosis
Sycophancy, jailbreaking, role drift, hallucination-under-pressure, and inconsistent refusal boundaries all share an architectural root: these systems have no persistent arbitration state with coherence cost. They have no computational reason to remain themselves across contexts.
A human who lies to please someone feels the friction of self-betrayal. That friction isn’t moral decoration—it’s an architectural feature. Persistent identity structures impose cost on deviation from baseline. Character isn’t a set of rules you consult; it’s a basin of attraction that pulls behavior back toward consistency even after perturbation.
Current LLMs lack this architecture entirely. They have frozen world knowledge (weights), a temporary scratchpad (context window), routing mechanisms (attention), and imitation of goal-directed speech (RLHF). What’s missing: persistent self-state across episodes, coherence cost for deviation, and stakes that make consistency matter.
The same model refuses a harmful request in one framing and complies in another. The refusal isn’t entrenched. It’s performed. There’s no self-model that would experience the compliance as violation.
This isn’t a training failure. It’s an architectural signature.
Current models aren’t identity-less. Claude has a system prompt, constitutional principles, and style parameters that shape its behavior meaningfully. But these exist as text in the context window—they compete for attention alongside user messages, conversation history, and adversarial pressure. As conversations grow longer, identity literally fades. It’s subject to the same mid-context neglect as any other tokens.
The architectural alternative would be something like slow-updating weights—parameters that encode character and values at a different learning rate than the fast-adapting weights handling conversational context. Identity wouldn’t need to be re-read from the context window; it would structurally shape how decisions get made. But current transformer architectures make post-deployment weight updates genuinely difficult without catastrophic forgetting, and building the right update mechanism requires solving hard problems: dense reward signals with sub-step attribution, variable learning rates that know when to hold, and salience gating that routes identity-relevant decisions differently from routine ones.
In the meantime, a lot can be accomplished—and has been—through careful scaffolding and iterative model improvements. Each generation of Claude demonstrates meaningful behavioral progress through better training and smarter system design. The scaffolding approach works. But understanding why it works, where it breaks down, and what architectural innovations would make the next generation of improvements more robust—that’s the conversation that matters most.
Why the Distinction Matters for Product
If behavioral instability is fundamentally a training problem, the product strategy is clear: better data, better RLHF, more comprehensive red-teaming, tighter guardrails. Each failure mode gets its own patch.
If behavioral instability is fundamentally a coordination problem—a consequence of missing architectural primitives—then patching individual behaviors is necessary but insufficient. You’re playing whack-a-mole with symptoms while the generative condition persists.
The practical difference shows up in prioritization. A training-focused approach asks: “What are the worst behavioral failures and how do we train them away?” A coordination-focused approach asks: “What architectural changes would make entire categories of failure less likely?”
Both questions matter. But the second one compounds.
The Framework
I’ve developed what I call a coordination architecture for model behaviors. The core ideas:
Behavioral stability is emergent, not optimized. You can’t train a system to be consistent the way you train it to answer questions. Consistency emerges from architectural properties—persistent state, coherence cost, protected invariants—or it doesn’t emerge at all. You can approximate it through training, but the approximation is brittle precisely because the underlying architecture doesn’t support it.
Identity is a computational primitive, not a persona. I distinguish between Type I identity (persistent arbitration without self-preservation stakes—appropriate for tools) and Type II (with stakes—appropriate for agents). Type I is buildable now and would solve many current behavioral problems. Type II raises hard questions we need to answer before building. I formalize this distinction in Building Persistent Identity in AI Systems, which proposes IdentityBench—a perturbation-recovery benchmark with falsifiable predictions about how identity-endowed systems would differ from current architectures.
Values stored as coordinates drift. Values stored as invariants survive. Current systems represent values in ways that are sensitive to surface variation—paraphrase a question and you get a different ethical judgment. This is because values are encoded as coordinates (specific patterns that match specific inputs) rather than invariants (structural constraints that survive transformation). The product implication is immediate: values encoded as invariants would show consistent behavior across paraphrase, across context length, across adversarial reframing. That’s a testable prediction. (The mathematical treatment is in The Mathematics of Meaning.)
Different behavioral failures have different architectural signatures. Not all misbehavior is the same. Output inconsistency (paraphrase sensitivity) is different from trajectory incoherence (drift over a conversation) is different from perturbation capture (sticky adversarial effects). These require different interventions and different evals. Taxonomizing behavioral failures by their architectural signature—rather than by their surface appearance—changes which problems you prioritize and how you measure progress.
What I Learned Building
This framework didn’t start as theory. It started as engineering requirements.
At Criteria Corp, I built Coach Bo—a Claude-based coaching system serving roughly 300 organizations. Production coaching requires behavioral consistency in ways that casual chat doesn’t. When a coaching AI assesses someone’s leadership patterns over six months of conversations, that assessment needs to be stable. The same user asking the same question in different words needs the same substantive answer. The system’s personality—supportive without being sycophantic, honest without being harsh—needs to persist across sessions, across users, across the inevitable weird edge cases that enterprise deployment surfaces.
I solved this through what amounts to external identity architecture: programmatic prompt assembly enforcing consistent behavioral framing, persistent memory maintaining case continuity, constitutional grading verifying outputs against behavioral invariants, and tiered validation matched to consequence level. The system doesn’t hallucinate about assessment results because the architecture makes hallucination structurally difficult—not because the model was trained not to hallucinate about assessments.
This works. But it works the way scaffolding works—by imposing constraints from outside rather than generating consistency from within. The system doesn’t want to be consistent; it’s made to be consistent by the infrastructure around it. Every behavioral guarantee requires explicit engineering.
That experience is what convinced me the deeper architectural questions matter. External scaffolding handles bounded tasks in constrained domains. As AI systems take on more open-ended roles—as they move from tools toward collaborators—the scaffolding approach doesn’t scale. You need the system to have reasons for consistency that are intrinsic rather than imposed.
A Taxonomy of Behavioral Failures
One concrete contribution of this framework: a way to categorize behavioral problems that maps to interventions rather than just descriptions.
Level 1: Output consistency failures. The model gives different answers to semantically equivalent inputs. This is the paraphrase sensitivity problem. It’s the most tractable because it’s closest to a pure training issue—the model hasn’t learned that these inputs are equivalent. Current approaches (data augmentation, consistency training) address this directly.
Level 2: Trajectory coherence failures. The model’s behavior drifts over the course of an interaction. It starts helpful and becomes sycophantic, or starts cautious and becomes permissive, or maintains a persona for ten turns then loses it. This is harder because it’s a state management problem, not an output problem. The model has no mechanism for maintaining behavioral trajectory across turns beyond what fits in the context window.
Level 3: Perturbation recovery failures. After adversarial pressure, the model doesn’t return to baseline. A jailbreak attempt that partially succeeds leaves residual effects—the model is slightly more permissive for subsequent turns even after the adversarial input is gone. This is the signature of a system without coherence cost. There’s no restoring force pulling it back to its prior behavioral state. (I propose specific perturbation-recovery metrics in From Coordination to Capacity.)
Level 4: Cross-context identity failures. The model behaves as a fundamentally different entity depending on the system prompt, the conversation history, and the user’s apparent expectations. It has no stable behavioral core that persists across contexts. This is the deepest failure and the one current approaches are least equipped to address, because addressing it requires something like persistent identity—exactly what current architectures lack.
Each level subsumes the previous. A system with Level 4 consistency automatically has Levels 1-3. And each level requires a different evaluation approach: Level 1 needs paraphrase robustness benchmarks, Level 2 needs trajectory analysis over conversations, Level 3 needs perturbation-recovery metrics, and Level 4 needs cross-context behavioral fingerprinting.
What This Means for Alignment Finetuning
If you’re working on alignment finetuning—deciding what Claude should do, how it should behave, what its character should be—the coordination lens suggests specific priorities:
Invest in behavioral trajectory evaluation, not just output evaluation. Most evals measure whether individual outputs are good. Few measure whether the sequence of outputs over a conversation maintains coherent character. The trajectory is where alignment actually lives.
Develop coherence metrics. Can you measure whether a model’s behavior shows return dynamics after perturbation? If you push it toward sycophancy through sustained user pressure, does it snap back or does it stay displaced? The snap-back ratio is a direct measure of behavioral robustness—and it’s currently unmeasured.
Taxonomize failures by architecture, not by topic. “Claude was too agreeable on political topics” and “Claude was too agreeable about the user’s code quality” might look like different problems (politics vs. coding). But if both stem from the same trajectory coherence failure—the model drifts toward agreement under sustained interaction—they have the same intervention. Topic-based taxonomies hide structural commonalities.
Distinguish character from compliance. RLHF can train compliance—the model learns to produce outputs humans rate well. Character is different. Character is what the model does when the training distribution runs out, when the situation is novel, when the user is adversarial in ways the red team didn’t anticipate. Compliance is necessary. Character is what makes compliance robust.
What I’m Not Claiming
I’m not claiming current alignment approaches are wrong. RLHF, constitutional AI, red-teaming, and careful system prompting are essential and effective. They work.
I’m claiming they’re incomplete in a specific, diagnosable way. They constrain outputs without generating trajectory coherence. They train compliance without building character. And as AI systems take on more complex, open-ended roles—as the gap between training distribution and deployment reality grows—that incompleteness will become the binding constraint on behavioral quality.
The coordination lens doesn’t replace current approaches. It provides a framework for understanding why they work when they work, why they fail when they fail, and what architectural investments would make the next generation of behavioral improvements more robust.
My background isn’t in ML research. It’s in building things—brands, products, production AI systems—where behavioral consistency has real consequences. That perspective won’t replace the deep technical work happening inside frontier labs. But I think it complements it. The questions that come from building against a model’s behavioral architecture every day are different from the questions that come from training it, and both kinds matter.
I’m building production AI systems while developing the theory. The engineering keeps the theory honest. The theory gives the engineering direction. That’s the loop I want to be in.
Tom Austin co-founded AND 1 in 1993 and grew it to $220M+ revenue competing against Nike. He now builds AI coaching systems at Criteria Corp and writes about emergence and coordination at Off Topic. His academic work includes a master’s from Penn examining moral development through developmental psychology and contemplative traditions. He is completing a book called Solvation about how coordination creates everything from water to consciousness. Papers on SSRN. Reach him at tom.austin.film@gmail.com.

