Jonathan Horvat · 2026 Complete Paper Set · Six Papers · Public Domain · No Rights Reserved
Paper 01 — Main Paper Intelligence at the Edge of the Cloud
White Paper · Architecture · Public Domain
Intelligence at the Edge of the Cloud
A Local Architecture for the AI Era
Jonathan Horvat
2026 · No rights reserved
Abstract

The prevailing assumption in artificial intelligence is that intelligence lives in the cloud — that to be smart, a device must be connected. This paper argues the opposite will become true. As models mature and weight compression advances, the edge will not retreat into dumbness when disconnected. It will carry the full depth of human knowledge locally, refreshed not by a live wire but by periodic synchronization of model weights.

We propose a two-model edge architecture: a broadcast-synced sparse expert world model and an on-device augmented personal model with a built-in interpreter head. Combined, this architecture addresses three problems at once: intelligence that is genuinely personal, privacy that doesn't depend on trusting a corporation, and a fraction of the energy and infrastructure cost of cloud inference.


Section I
The False Promise of the Cloud

Every major AI deployment today rests on the same implicit bargain: give us your data, your queries, your context — and we will give you intelligence. The user sends their thoughts to a remote server. The server processes them against a vast model. The answer returns. The data stays.

This is not a design choice made for the user's benefit. It is a consequence of scale. The models are too large to run locally. The compute is too expensive to distribute. Every question you ask, every document you analyze, every private thought you process through an AI system passes through infrastructure owned by a handful of corporations.

The problem is not that cloud AI companies are untrustworthy. The problem is that trust should not be required.

What follows is a description of an architecture in which it is not.

This is not an argument against the cloud. The cloud trains better models than any device ever will. It handles tasks that require computation at a scale no edge device can match. What this paper argues is simpler: not everything needs to be the cloud. A device that carries its own intelligence — that knows the world and knows you, locally, privately, offline — is not a retreat from capability. It is an addition to it. The architecture described here works alongside cloud AI, not instead of it. The local model is the default. The cloud remains available when you choose it. The difference is that choice now belongs to you.


Section II
Weights as Knowledge

A language model's weights are not code in the traditional sense. They are the accumulated result of exposure to human knowledge — billions of parameters shaped by contact with text, reasoning, and the patterns of thought that underlie both. A model's weights are, in a meaningful sense, a compression of what humanity knows — not a database, not a search index, but a form that encodes understanding rather than storing facts.

The sparse expert architecture described in this paper is not speculative. As of 2026, it is the dominant pattern at the frontier — the majority of leading model releases use sparse mixture-of-experts designs precisely because activating only a subset of experts per query delivers frontier capability at a fraction of the compute cost. The architecture inherits this efficiency directly.

Architectures improve, quantization and distillation advance, and the weight footprint for any given capability level continues to shrink. Capable models will run at the edge. The only question is when — and what the resulting architecture looks like.

→ [1]


Section III
The Sync Paradigm

A model weight update is categorically different from a software update. When a device receives updated weights, it does not merely gain new procedures. It absorbs new understanding. Months of collective learning compress into a payload the device integrates at rest, offline, without a persistent connection.

This is the sync paradigm: periodic, compressed, offline-compatible transfer of world-knowledge into edge devices. Not streaming. Not querying. Absorbing. In practice, a typical weekly sync payload for a curated expert subset is under 50 MB compressed — version-delta only. It fits comfortably in a background sync over Wi-Fi or metered 5G, with zero personal data ever leaving the device.

Diagram 01 · Sync Flow Weight synchronization — one directional, no personal data upstream
🌐
Frontier Infrastructure
Trains models. Human-gated publication to sync server. Broadcasts all expert weights. Cannot write to sync server directly.
discrete expert modelsusage-aware packagingcloud-side only
Expert weights only · Anonymized · No personal data upstream
📱
Edge Device
Absorbs relevant expert weights during sync window. Operates fully offline between syncs with frontier-level intelligence. Augmented personal model remains entirely untouched.
offline capablecurated expert subsetfull intelligence
Connectivity timeline — full intelligence offline between syncs
OFFLINE · full intelligence
SYNC
OFFLINE · updated
SYNC
OFFLINE · current

The sync request itself is the only fingerprint — and a remarkably thin one. It carries no query content, no personal context, no conversational history. With trivial hardening — batched anonymous requests through a proxy, timing randomization, CDN-style distribution of popular expert bundles — even this signal becomes indistinguishable from population-level noise. The privacy guarantee holds all the way down to the infrastructure layer.

Connectivity becomes optional. Intelligence does not.


Section IV
The Core Architecture

Weight synchronization alone is not enough. A device that knows what humanity knows still lacks personal context.

Diagram 02 · Core Architecture Stack Two core models — world model and augmented personal model — all local
👤
User
Query · intent · context — interfaces with augmented personal model only
receives from user ↓
Augmented Personal Model · Secure Enclave
Augmented Personal Model
Frozen base, trainable low-rank adapters encoding you, plus a fixed interpreter head for cross-space synthesis. Queries world model. Never transmits. Never exposes internal state.
frozen basepersonal adaptersinterpreter headsecure enclavesole attack surface
query ↓ / output ↑
read/write ↕
World Model · Sparse Experts · Pull-only
World Model
Pull-only. Receives weight syncs. Responds to queries from augmented personal model. Cannot initiate, transmit, or access personal data.
pull-onlysparse expert routingselective syncno personal data
Interpreter Head — Built In
Interpreter Head
Fixed, non-trainable cross-attention layer inside augmented personal model. Translates between independent embedding spaces. Frozen at compile time — never trained, never updated.
non-trainablecross-attentionfrozen at compileinside secure enclave
ComponentReads World ModelAccesses Personal AdaptersReaches CloudReceives from User
User
Augmented Personal ModelQuery onlyRead / WriteYes
World ModelSync pull only
Interpreter HeadOutput onlyRead only
The World Model

A sparse expert model carries the distilled knowledge of the frontier. Individual expert models are maintained as discrete units — both at the cloud level and on device — enabling selective synchronization matched to usage profile. The world model is pull-only: it receives weight syncs and responds to queries from the augmented personal model. It cannot initiate communication, cannot transmit, and has no access to personal data.

The Augmented Personal Model

A small, efficient model runs continuously on-device atop a frozen base. It carries thin trainable low-rank adapter layers that encode you — your language patterns, your reasoning style, your domains of interest, your history.

Built directly into this model is a fixed, non-trainable interpreter head — a small cross-attention layer, frozen at compile time. This head is the only component that ever sees both the world model output and your personal adapter. It translates between the two independent embedding spaces without ever letting the models touch, then synthesizes the final response.

The entire personal delta — adapters plus interpreter head — remains just a few megabytes. It is updated incrementally through interaction and never leaves the device. The interpreter head lives inside the same hardware secure enclave as the personal model, making the attack surface extremely narrow by design.

The user interacts only with the augmented personal model. The world model never sees personal context. The personal model never exposes its internal state. All orchestration happens locally, inside a single secure component.

→ [2, 3, 4]


Section V
Privacy as Architecture, Not Policy
Diagram 03 · Privacy Threat Model Attack surface analysis — cloud model vs edge architecture
⚠ Current Cloud Model
☁️
Personal data leaves device
Every query, document, and context transmitted to remote servers just to receive a response.
🏢
Corporate trust required
Privacy is a policy promise — revisable, breachable, erodable by business model changes.
🔓
Large diffuse attack surface
Data in transit, at rest on servers, in logs — multiple external exposure points.
✓ Edge Architecture
📱
Augmented personal model never leaves device
Local by design. No query requires transmitting personal context.
🏗️
Privacy is structural
A consequence of the architecture. Cannot be revoked by policy or eroded by business model.
🎯
Single narrow attack surface
Only the augmented personal model touches both — via its built-in interpreter head — locally, inside a secure enclave.
🫥
Broadcast eliminates the fingerprint
No request-response means no fingerprint. The server broadcasts to everyone. It never knows who kept what.
⚠ Residual Attack Surface
The augmented personal model is the sole component requiring hardening. Secure enclaves and trusted execution environments — already mature in consumer hardware — map directly onto this requirement. Best practices such as side-channel resistance, regular firmware updates, and runtime attestation further harden this single narrow surface. One local target is categorically more defensible than distributed cloud infrastructure.
Diagram 04 · Query Flow Comparison Cloud paradigm vs edge architecture
☁ Cloud Paradigm
origin
User forms a query
⚠ exposure
Personal context transmitted to remote server
⚠ processing
Processed against cloud model — data at rest on external infrastructure
⚠ dependency
No connectivity = no intelligence
result
Intelligent response returned
VS
📱 Edge Architecture
origin
User forms a query
✓ local
Augmented personal model receives intent — nothing leaves device
✓ local
World model queried · Interpreter head synthesizes · All local
✓ resilience
Works fully offline — intelligence is a local property
result
Intelligent response — private, local, current

Privacy becomes a property of the system, not a feature of the product. It cannot be revoked by a policy update. It cannot be eroded by a business model change. It simply is.


Section VI
Graceful Degradation as a First-Class Feature

A device with limited or outdated experts is not broken — it simply has known, expressible limits. The augmented personal model tracks exactly which weights are present, how current they are, and how well-matched they are to the query at hand.

Rather than hallucinating into gaps, the interpreter head surfaces a coverage score — a weighted combination of router activation confidence, expert freshness relative to the query domain, and embedding-space domain relevance. The result is a concrete honest signal: "Drawing from world knowledge current to March 2026 with 92% domain coverage for quantum error correction."

A user asks about the latest developments in quantum error correction given their prior work on surface codes. The system returns the coverage score, leans on personal adapters to tailor the response to their exact reasoning style and historical context, and surfaces uncertainty where it exists. No hallucination — just transparent, graceful degradation.

In degraded states it leans more heavily on what it knows about the user to fill gaps intelligently. The personal adapters don't degrade with the world model. They always reflect you.


Section VII
Neural Architecture

Two core models, each with independent embedding spaces, coordinated by a built-in interpreter head inside the augmented personal model.

Interpreter Head Mechanics

The interpreter head is a small, fixed, non-trainable cross-attention module — typically 4–8 heads, low dimensionality — frozen at compile time. It takes the world model's output tokens in their native embedding space and attends to them from the personal model's context tokens. Because it is frozen and read-only, no gradients ever flow back into either model during adapter training. It acts purely as a translator while preserving strict separation.

world_output  = world_model(query_from_personal)    # embedding space W
personal_ctx  = personal_adapters(user_history)     # embedding space P
translated    = interpreter_head.cross_attention(
    query=personal_ctx,
    key=world_output,
    value=world_output
)
final_response = personal_model.decode(translated)

The world model remains a sparse expert system that is pull-only and periodically refreshed via sync. The augmented personal model contains both the trainable low-rank adapters that encode the user and a fixed interpreter head that performs the cross-space synthesis. This head bears the full computational cost of translation while preserving strict separation: the world model never sees personal data, and the personal model never leaks its private state outward.

All other architectural benefits — selective sync, graceful degradation, per-expert quantization, and incremental on-device training — remain unchanged.

Neither component requires new invention. Sparse expert world models are already the dominant frontier architecture. Adapter-based on-device personalization with a frozen base is already in production — major device manufacturers ship exactly this pattern today. The contribution here is the combination: privacy-first, offline-resilient, locally intelligent.

Diagram 05 · Neural Architecture Blueprint Two core models — interpreter head built into augmented personal model
World Model Sparse Experts · Broadcast Receive
Base
Shared Router
Routes queries to relevant experts. Frozen between syncs.
Expert A
Domain Expert
Higher precision. Frequently used. Prioritized in sync.
Expert B–N
Domain Experts
Lower precision fallback. Synced by usage profile.
Quantization
Per-expert precision
Frequent experts at higher precision. Fallbacks aggressively quantized. Profile evolves with usage.
Embedding
Independent space
Does not share embedding space with augmented personal model.
Broadcast: Server broadcasts all expert weights. Devices retain what they use locally. Bandwidth cost is version-delta only — typically <50 MB/week compressed.
↕ queried by augmented personal model only · through user intent as lens
Augmented Personal Model Secure Enclave · Interpreter Head Built In
Interpreter Head
Fixed cross-attention
Non-trainable. Frozen at compile time. Translates between world model output and personal adapter space. Bears full cross-space synthesis cost.
Personal Adapters
Low-rank delta
Trainable layers encoding you. A few MB. Updated by behavioral observation only. Never transmitted. Never supervised.
Frozen Base
Shared foundation
Shared across interpreter head and personal adapters. Broadcast on slow cadence. Geometry-constrained updates preserve adapter integrity.
Degradation
Coverage score
Tracks expert presence and freshness. Surfaces coverage score rather than hallucinating. Personal adapters always reflect you regardless of world model state.
Security: Runs inside a hardware secure enclave. The sole attack surface of the entire architecture. Dedicated hardware-level protection is a fundamental architectural requirement.

→ [4, 7, 8]

An Open Research Direction

The interpreter relationship described here may itself be more than a fixed mechanism — it may be emergent. Prior work in multi-agent reinforcement learning has shown that neural networks develop compressed, task-specific communication protocols unprompted when given a consistent interface and a signal that rewards successful coordination. The personal model learning to read the world model through observation may follow the same principle: given enough interaction, a high-bandwidth AI-native channel develops that is more efficient than anything a human designer would specify, and specific to this device, this user, this pair of models.

Two findings from the broader literature suggest this channel may be more learnable than expected. Large language models have been shown to develop functional organization strikingly similar to human brain fMRI activation patterns — consistent specialization emerging independently across models trained on similar architectures. And the human brain itself exhibits fractal organizational structure across scales, with growing evidence that sufficiently trained artificial networks converge toward similar self-similar structure under the pressure of efficient compression. If independently trained models develop analogous internal organization, the personal model is not learning to read an alien space — it is finding correspondence between two instances of the same underlying pattern.

Whether oscillatory synchronization between models — analogous to neural entrainment in biological systems — could accelerate the formation of this channel, and whether fractal scale-invariance makes it transferable across capability upgrades, are open research questions. The architecture creates the conditions. What emerges is worth measuring.

→ [5, 6, 9, 10, 11, 12, 13, 14, 15]


Section VIII
The Graduated Transition

This architecture does not require a utopian leap. It is viable today at reduced capability, and becomes more capable as hardware advances. The weight complexity synced to a device scales with what that device can run. The architecture is constant across the capability curve — only the ceiling changes.

As edge chips improve, larger and more capable world model weights become syncable. As world models become more efficient through distillation and quantization, they fit on less capable hardware. Both curves accelerate each other. The transition is a ramp that builds itself.

In a sparse expert architecture, different expert weights can carry different precision profiles — frequently used experts at higher precision, rarely accessed fallbacks more aggressively quantized. The augmented personal model's usage patterns inform this profile over time — the device becomes progressively better calibrated to its owner without any external direction.

Today, a mid-range smartphone running a 3–7 billion parameter quantized world model delivers genuine general intelligence locally — summarization, reasoning, writing, analysis — across the domains its expert profile covers. The personal adapter set starts empty and becomes calibrated within days of normal use. The architecture is useful from day one and better by day thirty. At the other end of the curve, as dedicated edge AI chips mature and model compression continues, the gap between local and cloud capability narrows to the point where the distinction becomes about latency and privacy preference rather than raw intelligence. The transition does not require waiting for that end state. Throughout it, the cloud remains available for tasks that exceed local capability. Early adopters route to the cloud selectively. Later adopters rarely need to. The architecture accommodates both without changing.



Section IX
Physical AI

The cloud model is not merely inefficient for physical AI. It is architecturally incompatible with it.

01
Local intelligence is a hard constraint
Physical systems operate in real time. When the consequence of latency is physical — a fall, a collision, a missed action — a cloud round trip is not slower, it is disqualifying. Intelligence must live where the action is.
02
Connectivity in physical environments is unreliable by nature
Physical environments are dynamic, adversarial, and unpredictable. An architecture premised on connectivity will fail at the moments that matter most. An architecture premised on local intelligence with periodic sync will not.
03
Physical systems that know their operator are more capable and safer
A system calibrated to a specific person — their patterns, their tolerances, their preferences — performs better and fails more gracefully than a generic one. The personal adapter model achieves this privately, without transmitting anything about the person.
04
The broadcast model scales naturally to physical deployment
Fleets, networks, distributed systems — all receive the same broadcast, all operate independently, all stay current without central dependency or single point of failure. There is no architecture to coordinate, no hub to lose.
05
Conservative failure is a safety property
A cloud-dependent physical system that loses connectivity faces two bad options: stop, or continue on intelligence of unknown currency. A local system with explicit uncertainty knows exactly how current its world model is and can fail conservatively. Graceful degradation in physical AI is not a convenience feature — it is a design requirement.

This architecture does not just accommodate physical AI. It is the architecture physical AI requires.

→ [18, 19, 20]

Section X
Consequences at Scale
Infrastructure Displacement

The current AI infrastructure buildout is premised on inference at scale — serving billions of queries continuously through cloud architecture. The ongoing energy cost of AI lives here, not in training. Moving inference to the edge eliminates this cost at the cloud level. The aggregate power consumption of edge devices is not zero — but the efficiency advantage of local inference over remote inference is so large that the net reduction is substantial. The ongoing energy cost of AI becomes a bounded problem rather than an open-ended one.

Beyond energy, the continuous internet infrastructure that exists primarily to shuttle queries to models and return answers becomes largely unnecessary for AI interaction. The infrastructure problem becomes one of occasional high-quality sync windows rather than permanent fat pipes — a categorically different and cheaper problem.

Geopolitical Leapfrogging

The mobile phone leapfrogged landline infrastructure across much of the developing world. This architecture enables a more complete leapfrog for AI and, ultimately, general connectivity. The only infrastructure requirement is the device itself and an occasional low-bandwidth sync window — deliverable by satellite, intermittent wifi, or peer-to-peer mesh networks sharing weight updates between devices.

Nations that have resisted cloud AI adoption because it meant routing citizens' data through foreign infrastructure gain an alternative that requires no such compromise. Intelligence becomes a local resource rather than a foreign service. A government can cut internet access. It cannot cut intelligence that is already local.

The Edge Chip as Strategic Battleground

If inference moves to the edge permanently, the semiconductor industry's priorities shift. The device chip must run world model inference and the augmented personal model continuously, efficiently, and securely. Power efficiency, unified memory, neural processing unit integration, and hardware-level secure enclaves become the metrics that matter.

A new measure emerges to replace tokens-per-second: intelligence per milliwatt. Whoever defines and wins that metric defines the next era of semiconductors. Hardware secure enclaves for the augmented personal model become a durable competitive moat.

→ [16, 17]


Section XI
The Human Analogy

The architecture maps naturally onto human cognition. The personal adapters correspond to episodic and autobiographical memory — individual, never shared wholesale. The world model corresponds to semantic memory — general knowledge of how things work. The augmented personal model corresponds to executive function — dynamically combining both in context.

These systems in the human mind do not merge. They are accessed and combined dynamically, in response to context. The architecture didn't invent this separation — it encodes something cognition already figured out.


Section XII
Environmental Implications

The prevailing assumption in AI infrastructure is that intelligence scales with physical build-out. More compute means more capability. More capability means more data centers. More data centers mean more cooling, more power, more fiber, more buildings, more land, more water. This is not a temporary phase. It is the foundational logic of the cloud model — and its environmental consequences are structural outputs of that model, not incidental side effects.

This architecture inverts that assumption. Intelligence stops being something you build out and starts being something you compress inward.

The cloud model treats intelligence as a utility that must be generated centrally and transmitted. This architecture treats intelligence as a property that can be stored locally and updated periodically. The environmental difference between those two models is the difference between a power grid and a battery.

Energy — Inference, Not Training

The public conversation about AI's energy footprint focuses heavily on training — the large, visible, one-time cost of building a model. But training is bounded and periodic. Inference is continuous, and it scales with every user on earth, every query, every day. Moving inference to the edge eliminates this ongoing cost at the cloud level. Training infrastructure remains — but training a model once and distributing its weights is categorically different from serving that model's outputs billions of times daily.

Water

Data centers are large consumers of water, used primarily for cooling — a cost largely invisible in public accounting of AI's environmental impact. The water footprint of a single large data center can rival that of a small city. As construction accelerates globally, this burden falls on local watersheds and aquifers, often in regions already under stress, chosen precisely because land and power are cheap.

Geographic Concentration of Burden

Cloud infrastructure does not distribute its environmental burden evenly. Data centers cluster near cheap power and water — which means specific regions, specific communities, and specific ecosystems absorb disproportionate costs. Rivers near data center clusters run warmer. Local grids are stressed. Distributed edge intelligence diffuses this burden across billions of devices that people are already powering. No community bears a concentrated environmental cost. No watershed is selected for proximity to cheap land.

Intelligence Per Milliwatt

The current dominant metric in AI hardware is tokens per second — optimized for cloud-scale inference serving. For the edge architecture, the right metric is different: intelligence per milliwatt — how much genuine reasoning capability can be delivered per unit of energy on a constrained device. This reorientation builds energy efficiency into the fundamental design objective of the next generation of AI hardware, rather than treating it as a secondary concern.

The history of energy suggests that distributed storage is more resilient, more equitable, and ultimately more efficient than centralized generation and transmission at scale. The same logic applies to intelligence.

→ [16, 17]


Section XIII
On Sharing This Freely

This paper is public domain. The ideas here belong to no one. They are offered to the commons in the same spirit that the foundational architectures of the internet were offered — not as products to be extracted from, but as gifts to a future that will build on them in ways their originators could not predict.

If something here is useful, use it. If something here is wrong, correct it publicly. If something here sparks a better idea, share that too.


Section XIV
Reference Implementation Sketch

This architecture is implementable today on any device with a neural processing unit and hardware secure enclave — Apple A/M-series with Secure Enclave, Qualcomm Snapdragon with Hexagon NPU and TrustZone, and equivalents. The world model runs on existing sparse-MoE runtimes. Personal adapters use on-device LoRA training loops. The interpreter head ships as a static frozen ONNX or CoreML module. Sync logic is a background asset pipeline with delta patching.

No new silicon or OS primitives are required. The architecture composes what already ships in 2026 consumer hardware. The only thing missing is the decision to build it this way.


References
  1. [1]Dai, D. et al. (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066.
  2. [2]Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
  3. [3]Kopiczko, D. J. et al. (2025). Low-rank adaptation for edge AI. Scientific Reports.
  4. [4]Apple Machine Learning Research (2024/2025). Apple Intelligence Foundation Language Models. machinelearning.apple.com. — Adapter-based on-device personalization with frozen base, in production across Apple devices.
  5. [5]Liu, Y. et al. (2025). Brain-Inspired Exploration of Functional Networks and Key Neurons in Large Language Models. arXiv:2502.20408.
  6. [6]AlKhamissi, B. et al. (2024). The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units. arXiv:2411.02280.
  7. [7]Anthropic (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. transformer-circuits.pub.
  8. [8]Bereska, L. & Gavves, S. (2024). Mechanistic Interpretability for AI Safety — A Review. arXiv:2404.14082.
  9. [9]Wang, Y. et al. (2024). Neuro-evolutionary evidence for a universal fractal primate brain shape. eLife. DOI:10.7554/eLife.92080.
  10. [10]Hilgetag, C. C. & Goulas, A. (2020). The fractal brain: scale-invariance in structure and dynamics. Cerebral Cortex, 33(8), 4574–4587.
  11. [11]Miyato, T., Löwe, S., Geiger, A. & Welling, M. (2025). Artificial Kuramoto Oscillatory Neurons (AKOrN). ICLR 2025 (Oral). arXiv:2410.13821.
  12. [12]Shim, G. et al. (2025). HoloBrain and HoloGraph: Brain-inspired oscillatory synchronization for graph neural networks. Nature Communications.
  13. [13]Scientific Reports (2025). Deep Oscillatory Neural Network (DONN). Vol. 15, Article 40968.
  14. [14]Zhu, C. et al. (2024). A Survey of Multi-Agent Deep Reinforcement Learning with Communication. Autonomous Agents and Multi-Agent Systems, Vol. 38.
  15. [15]Li, Y. et al. (2015). Convergent Learning: Do Different Neural Networks Learn the Same Representations? arXiv:1511.07543. — Empirical evidence that independently trained networks converge on similar representational features.
  16. [16]MIT Technology Review (2025). We did the math on AI's energy footprint. technologyreview.com. — Inference now accounts for 80–90% of AI computing energy.
  17. [17]International Energy Agency (2024). Energy and AI: Energy Demand from AI. iea.org. — Global data center electricity consumption projected to double to 945 TWh by 2030, with AI as the primary driver.
  18. [18]Chafii, M. et al. (2023). Emergent Communication in Multi-Agent Reinforcement Learning for Future Wireless Networks. arXiv:2309.06021.
  19. [19]TechRxiv (2025). Physical AI: Bridging the Sim-to-Real Divide Toward Embodied Intelligence. — Latency, bandwidth, and reliability constraints render cloud-centric inference architectures insufficient for physical AI.
  20. [20]Edge AI and Vision Alliance (2026). On-Device LLMs in 2026: What Changed, What Matters, What's Next. — Cloud round-trips add hundreds of milliseconds, breaking real-time physical experiences.
Public domain. No rights reserved. Build freely.
Jonathan Horvat
Paper 02 — Companion Why This Is Not Federated Learning
Companion Paper · Privacy Architecture
Why This Is Not Federated Learning
A Precise Distinction Between Two Privacy Models
Jonathan Horvat
2026 · Public domain · No rights reserved
Abstract

When readers encounter the edge AI architecture described in Intelligence at the Edge of the Cloud, the most common instinct is to reach for federated learning as a comparison. The surface similarities are real: both keep raw data on device, both involve some relationship between device and server, both claim privacy properties. But the underlying privacy models are categorically different — not incrementally, not in degree, but in kind. This paper makes that distinction precise. Federated learning privatizes data while the server continues to learn from devices. The broadcast sync architecture eliminates the server's ability to learn from devices entirely. Those are different guarantees with different threat models, different attack surfaces, and different trust requirements. Understanding why matters both for evaluating the architecture and for knowing when each is the right choice.


This paper is a companion to: Horvat, J. (2026). Intelligence at the Edge of the Cloud: A Local Architecture for the AI Era. Public domain.

Section I
The Comparison That Will Be Made

Federated learning is the right frame to reach for. It is the most prominent privacy-preserving machine learning paradigm, it involves devices and servers, it keeps raw training data local, and it has a substantial research literature behind it. When someone reads an architecture proposal that also involves devices, servers, and local data, federated learning is the natural reference point.

The comparison is not wrong. The two share a family resemblance. Both are responses to the same problem: personal data is valuable for training AI systems, but centralizing that data creates privacy risks and concentrations of power. Federated learning was designed to solve that problem. The broadcast sync architecture described here was designed to solve a different but related problem: personal data should never reach the server at all, not even in derivative form.

That difference — derivative form — is where the architectures diverge. Federated learning transmits gradients, not raw data, and treats the distinction as a privacy guarantee. The broadcast sync architecture transmits nothing from the personal model, and treats the absence of transmission as the guarantee. These are not the same claim.

Federated learning asks: can we learn from data without seeing the data? The broadcast sync architecture asks: can we build personal intelligence without the server learning anything at all?


Section II
What Federated Learning Actually Does

Federated learning, introduced by McMahan et al. in 2017, is a distributed training protocol in which a central model is improved using data that never leaves client devices. Devices download the current model, train locally on their private data, compute gradient updates, and send those gradients — not the raw data — to the server. The server aggregates gradients from many devices, updates the central model, and distributes the improved model back. The cycle repeats.

This is a genuine privacy improvement over centralizing raw data. The training data never leaves the device. The server never stores personal records. From a legal and operational standpoint, federated learning is substantially better than collecting data centrally.

The protocol has three components that matter for privacy analysis: local training, gradient transmission, and server aggregation. The first is entirely private. The third happens server-side and involves no individual device data. The second — gradient transmission — is where the privacy question lives.

Diagram 01 · Protocol Comparison Communication flows — what travels between device and server
⚠ Federated Learning
01Server sends current model weights to device
02Device trains locally on private data
03Device sends gradients to server — derivative signal of private data
04Server aggregates gradients from all devices
05Server updates central model — server learns from device data
06Improved model redistributed to devices
✓ Broadcast Sync Architecture
01Server broadcasts world model weights — one direction, no request-response
02Device receives broadcast passively — retains what it uses locally
03Personal model trains entirely on-device — behavioral observation only
04Nothing leaves the device — no gradients, no updates, no signal of any kind
05World model improves through centralized training on separate data — device uninvolved

The critical difference is in step three of each protocol. Federated learning requires the device to transmit a gradient. The broadcast sync architecture has no step in which the device transmits anything about the user. The personal model's training signal never leaves the hardware boundary.

→ [1]


Section III
Gradients Are Not Private

The federated learning privacy argument rests on the assumption that gradients are safe to share — that while raw data is sensitive, the mathematical derivatives of training are not. This assumption was overturned in 2019.

Zhu, Liu, and Han demonstrated that private training data can be reconstructed from shared gradients with high fidelity. Their attack — Deep Leakage from Gradients — works by optimizing a dummy input to match the observed gradient. Given the gradient and the model architecture, an attacker can recover the original training sample. For images, recovery is pixel-accurate. For text, recovery is token-accurate.

⚠ The Gradient Inversion Attack — Zhu et al. (2019)

The attack proceeds as follows. An attacker with access to a client's gradient — which in federated learning includes the server itself — initializes a random dummy input and computes its gradient using the shared model architecture. The attacker then optimizes the dummy input by minimizing the distance between its gradient and the observed client gradient. As the optimization converges, the dummy input approaches the actual training data.

The result: gradients leak the data they were computed from. Not probabilistically, not in aggregate, but directly and specifically. The training data can be reconstructed from the gradient alone.

Subsequent work by Geiping et al. (2020) extended the attack to high-resolution images at larger batch sizes, making it practical in realistic federated learning deployments — not just in carefully constructed experiments. The vulnerability is not a theoretical edge case. It is an active research area precisely because the attack works.

The honest-but-curious server — a server that faithfully executes the FL protocol but inspects received gradients — has full access to this attack. No protocol violation is required.

Defenses exist — differential privacy noise, gradient compression, secure aggregation — but each involves tradeoffs. Differential privacy noise degrades model quality. Gradient compression loses information that may be useful for training. Secure aggregation adds computational overhead and cryptographic complexity. None eliminates the channel through which the attack flows. They make the attack harder; they do not remove the attack surface.

The federated learning privacy guarantee is: we do not collect your data, and we make it difficult to reconstruct your data from the derivative signal we do collect. The broadcast sync guarantee is: there is no derivative signal. There is nothing to reconstruct.

→ [2, 3]


Section IV
The Architectural Difference

The broadcast sync architecture does not modify federated learning's communication protocol to make it safer. It eliminates the communication protocol for the personal model entirely.

The personal model — the component that knows the user — trains through behavioral observation on-device. It observes how the user interacts: when they rephrase a query, when they accept a response and move on, when they push deeper. These signals update the low-rank adapter weights that encode the user's patterns. No gradient is computed with respect to an external objective. No update is transmitted. No server is involved in the personal model's improvement at any stage.

The world model — the component that knows the world — is improved entirely separately. It is trained on data that has no relationship to any specific user. Its weights are distributed via broadcast: a one-directional transmission with no request-response channel. The server broadcasts to every device. It does not know which devices received what. It does not know which users exist. There is nothing to aggregate, nothing to invert, nothing to subpoena.

Diagram 02 · Threat Model Comparison Attack surfaces in each architecture
⚠ Federated Learning
📡
Gradient transmission channel
Device sends gradient to server on every training round. Server has persistent access to derivative signals from all devices.
🔬
Gradient inversion attack
Server — or man-in-the-middle — can reconstruct training data from gradients. Pixel-accurate for images, token-accurate for text.
🏢
Honest-but-curious server
A server that executes the protocol faithfully but inspects received gradients has full access to the inversion attack. No breach required.
⚖️
Legal compellability
Gradient logs held server-side are subject to legal process. The server accumulates a history of device signals over time.
✓ Broadcast Sync Architecture
📵
No uplink channel from personal model
The personal model never transmits. No gradient, no update, no signal. There is no channel to intercept or compel.
📻
Broadcast eliminates fingerprint
Server broadcasts to all devices identically. No request-response means no record of who received what. Server cannot identify individual devices.
🔒
Single local attack surface
The augmented personal model in a hardware secure enclave is the only component requiring hardening. Physical proximity required for attack.
⚖️
Nothing to compel
Server holds no user-derived data. Legal process against the server yields nothing about individual users. Intelligence is local property.

The threat models are not just different in degree. They are structurally different. Federated learning's residual risk lives in a communication channel that is fundamental to its operation — you cannot remove gradient transmission and still have federated learning. The broadcast sync architecture's residual risk lives in the physical device — a local, hardware-bounded surface that requires proximity to exploit. One attack surface is remote and scalable. The other is local and resource-intensive.

→ [1, 2, 3, 4]


Section V
The Privacy Model Comparison
Property Federated Learning Broadcast Sync Architecture
Raw data transmission Never transmitted Never transmitted
Gradient transmission Required — fundamental to protocol None — no transmission from personal model
Server learns from user Yes — central model improves from aggregated gradients No — world model trained on separate data entirely
Gradient inversion risk Present — active attack vector, pixel-accurate reconstruction demonstrated Eliminated — no gradient to invert
Server fingerprint Yes — server knows which devices participated and when No — broadcast has no request-response; server cannot identify receivers
Legal compellability Server holds gradient history — subject to legal process Server holds no user-derived data — nothing to compel
Trust requirement Honest-but-curious server is a threat model concern Server cannot learn from devices regardless of intent
Personal model improves from Aggregated device gradients — cross-device learning Behavioral observation — single device, no external signal
Residual attack surface Remote, scalable — gradient interception over network Local, physical — secure enclave on device

The table makes the structural difference visible. Every row where federated learning has a residual risk traces back to the gradient transmission channel. Every row where the broadcast sync architecture has no risk traces back to the absence of that channel. The privacy guarantees are not on the same spectrum. They are different architectures for different threat models.


Section VI
The Honest Tradeoff

Federated learning's gradient transmission is not only a vulnerability. It is also a feature. By aggregating gradients from millions of devices, federated learning can improve a central model using patterns that exist across the user population — patterns that no individual device could observe alone. A keyboard that improves its predictions by learning from all users collectively is using federated learning's core capability. The privacy cost and the capability are inseparable.

The broadcast sync architecture gives this up entirely. The world model improves through centralized training on data that has no connection to individual users. The personal model improves through observation of a single user. There is no cross-device learning, no population-level signal, no capability that emerges from aggregating across users. What is gained in privacy is paid for in this capability.

Whether that tradeoff is worth making depends on what you are building. For a keyboard prediction model, federated learning's cross-device learning is the entire point — the model gets better precisely because it learns from everyone's typing patterns. For a personal intelligence system that knows deeply private context — health, finances, relationships, professional reasoning — the cross-device capability matters less and the privacy guarantee matters more. The architecture should match the threat model the application actually faces.

Federated learning is the right answer when improving a shared model from distributed data is the goal. The broadcast sync architecture is the right answer when the personal model must be completely sovereign and the world model can be trained on other data. These are different use cases, not competing solutions to the same problem.


Section VII
When Each Is Right
Use Federated Learning When

The primary goal is improving a shared model from distributed data. The privacy requirement is stronger than raw data centralization but does not need to eliminate derivative signal transmission. The use case benefits from cross-device learning — population-level patterns improve the shared model meaningfully. The application can accept and mitigate gradient inversion risk through differential privacy or secure aggregation. Examples: shared keyboard prediction, medical image classification across hospitals, fraud detection across financial institutions.

Use the Broadcast Sync Architecture When

The personal model handles deeply private context — health, finances, professional reasoning, relationships — where even derivative signal transmission is unacceptable. The application operates in environments where connectivity is unreliable or latency is critical. The world model can be trained on data that is not derived from individual users. Legal risk requires that the server hold nothing that can be compelled. The use case requires intelligence that is specifically calibrated to one person and must not contribute to a shared model. Examples: personal AI assistant, physical AI systems with operator-specific calibration, sovereign AI for national or organizational use.

The Fingerprint Distinction

One further difference deserves explicit attention. Federated learning generates a communication record — the server knows which devices participated in which training rounds and when. This record exists even when gradients are protected by secure aggregation. It constitutes a participation fingerprint: evidence that a device was present and active, even if the content of its contribution is hidden.

The broadcast sync architecture generates no such record. The server broadcasts without knowing who receives. There is no participation fingerprint. A device that syncs is indistinguishable from a device that does not, from the server's perspective. This matters in adversarial contexts — legal, national security, or personal safety — where even the fact of participation is sensitive.

→ [1, 2, 3, 4, 5]


References
  1. [1]McMahan, B., Moore, E., Ramage, D., Hampson, S. & Arcas, B. A. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS 2017. arXiv:1602.05629. — Foundational federated learning protocol.
  2. [2]Zhu, L., Liu, Z. & Han, S. (2019). Deep Leakage from Gradients. NeurIPS 2019. arXiv:1906.08935. — Demonstrated pixel-accurate and token-accurate reconstruction of training data from shared gradients.
  3. [3]Geiping, J., Bauermeister, H., Dröge, H. & Moeller, M. (2020). Inverting Gradients — How Easy Is It to Break Privacy in Federated Learning? NeurIPS 2020. arXiv:2003.14053. — Extended gradient inversion to high-resolution images at realistic batch sizes.
  4. [4]Horvat, J. (2026). Intelligence at the Edge of the Cloud: A Local Architecture for the AI Era. Public domain. — Primary paper describing the broadcast sync architecture this paper distinguishes from federated learning.
  5. [5]Bonawitz, K. et al. (2017). Practical Secure Aggregation for Privacy-Preserving Machine Learning. ACM CCS 2017. — Secure aggregation as a FL defense; noted as adding complexity without removing the gradient transmission channel.
Public domain. No rights reserved. Build freely.
Jonathan Horvat · 2026
Paper 03 — Companion Intelligence Per Milliwatt
Position Paper · AI Hardware Metrics
Intelligence Per Milliwatt
Why Tokens Per Second Is the Wrong Metric and What to Replace It With
Jonathan Horvat
2026 · Public domain · No rights reserved
Abstract

The dominant metric for AI hardware performance is tokens per second — a supply-side measure optimized for cloud serving infrastructure. As inference moves from data centers to edge devices, tokens per second measures the wrong thing. It rewards architectures that excel at batch serving over high-bandwidth memory while ignoring the constraint that actually binds at the edge: power. This paper proposes Intelligence Per Milliwatt (IpMW) as the correct metric for the edge AI era, defines it formally as a composite of reasoning capability, personalization delta, and degradation honesty normalized to average inference power draw, and derives the architectural implications. We argue that IpMW is to edge AI what miles per gallon was to the automobile industry — a demand-side metric that, once standardized, restructures an entire industry's design priorities.


This paper is a companion to: Horvat, J. (2026). Intelligence at the Edge of the Cloud: A Local Architecture for the AI Era. Public domain.

Section I
The Wrong Metric

Tokens per second became the dominant AI hardware benchmark because it measures the thing that matters most in cloud serving: throughput. A data center running inference for millions of users needs to move tokens fast. Every millisecond of latency at scale multiplies into real cost. Every token per second of throughput gained is revenue earned or infrastructure cost avoided. The metric is not arbitrary — it was the right measure for the problem that existed.

That problem is not the only problem. As inference moves permanently to edge devices — smartphones, wearables, embedded systems, physical AI platforms — the constraint that binds is no longer throughput. It is power. A device with 3,000 milliwatts of sustained power budget for AI inference cannot run a model that draws 8,000 milliwatts at peak, regardless of how many tokens per second it produces at that draw. Battery life, thermal envelope, and die area all impose hard limits that tokens per second ignores entirely.

Optimizing for tokens per second at the edge produces the wrong architecture. It rewards large batch processing, high memory bandwidth, and parallel activation — properties that favor server-class hardware. It penalizes the sparse activation patterns, aggressive quantization, and selective expert loading that make edge inference efficient. A chip that delivers 200 tokens per second at 4,000 milliwatts scores lower than one delivering 180 tokens per second at 800 milliwatts, even though the second chip delivers far more value per unit of energy spent.

Tokens per second is a supply-side metric. It measures what the hardware produces. The edge AI era needs a demand-side metric — one that measures what the user receives per watt spent.

→ [1, 2]


Section II
Supply-Side vs Demand-Side Thinking

The distinction between supply-side and demand-side metrics matters more than it might appear. Supply-side metrics measure what a system can produce under favorable conditions — peak throughput, maximum bandwidth, theoretical TOPS. They are useful for comparing hardware capabilities in isolation. They are poor predictors of user value in deployment.

Demand-side metrics measure what the user actually receives per unit of resource consumed. Miles per gallon is the canonical example. Internal combustion engines were optimized for decades on horsepower — a supply-side metric measuring what the engine could produce. When fuel economy standards forced the industry to optimize on miles per gallon — a demand-side metric measuring value delivered per gallon consumed — the entire architecture of engine design changed. Direct injection, variable valve timing, cylinder deactivation, turbocharging for efficiency rather than power — all of these emerged from optimizing for the right metric.

The AI hardware industry is at an analogous inflection point. The data center era optimized on supply-side metrics because cloud economics rewarded throughput. The edge era changes the economics fundamentally. The user cares about whether the device is intelligent, whether it knows them specifically, and whether it lasts through a full day of use. None of those outcomes are predicted by tokens per second.

Intelligence Per Milliwatt is the miles per gallon of edge AI. It measures value delivered — intelligent, personalized, honest responses — per unit of energy consumed. Optimizing for it produces different architectural choices than optimizing for throughput, and those choices are the right ones for the edge deployment context.


Section III
Defining Intelligence Per Milliwatt

A useful metric requires a precise definition. IpMW has two components: a numerator that measures intelligence delivered and a denominator that measures energy consumed. Both require careful specification.

Definition · Intelligence Per Milliwatt Formal specification
IpMW  =    α·R  +  β·PΔ  +  γ·DH         AvgPowermW      
R — Reasoning Normalized score on a reasoning benchmark suite (MMLU, ARC-Challenge, or equivalent), expressed as a percentage of a reference cloud baseline scored at 1.0. Measured on tasks within the device's declared expert coverage domain.
PΔ — Personalization Delta Percentage improvement of the personal adapter model over the base model on a suite of user-specific prompts drawn from actual interaction history. Captures the value added by personal calibration beyond the general capability baseline.
DH — Degradation Honesty Accuracy of the coverage score as a predictor of actual task performance in low-coverage domains. Measured as 1 − (mean absolute error between stated coverage and observed accuracy). A system that hallucinates confidently in low-coverage domains scores near zero regardless of raw capability.
α, β, γ — Weights Application-specific weighting coefficients summing to 1.0. Physical AI deployments weight DH highest. Personal assistant deployments weight PΔ highest. General-purpose deployments may weight equally. Published benchmarks should specify weights used.
AvgPowermW Average system power draw in milliwatts during a complete inference cycle — from query receipt through response generation — measured at the device level. Includes NPU, memory access, enclave overhead, and CPU coordination. Not peak draw. Not NPU draw alone. Full system average.
Measurement note: Power must be measured at the full system level, not component level. NPU peak draw systematically understates true inference cost by 40–60% due to memory bandwidth, CPU coordination, and thermal management overhead. IpMW benchmarks that report NPU-only power are not comparable to those reporting full system power and should be clearly distinguished.

The three-component numerator is deliberate. A metric that measured only reasoning capability would reward large models with high benchmark scores regardless of whether they are honest about their limits or calibrated to the user. A metric that measured only personalization delta would reward systems that over-fit to user history at the expense of general capability. Degradation honesty is the component most absent from existing edge AI benchmarks and most consequential for real deployment — a system that confidently hallucinates is actively harmful regardless of its average accuracy.

→ [3, 4]


Section IV
What the Numerator Actually Measures
Reasoning Capability (R)

Normalized against a reference cloud baseline, R measures whether the edge device can actually think — not just generate tokens, but produce responses that reflect genuine reasoning about the query. The normalization against a cloud baseline matters: it grounds the score in something meaningful rather than an arbitrary point scale. A device scoring R = 0.75 delivers reasoning quality at 75% of a capable cloud model on covered domains. That is a concrete, interpretable claim.

Importantly, R is measured only within the device's declared expert coverage domain. A device that covers quantum computing but not genomics should not be penalized on genomics tasks — it should score on what it claims to cover. This connects R to the coverage score: a device with an accurate coverage score and high R within that coverage is genuinely more useful than a device with high R but poor coverage prediction.

Personalization Delta (PΔ)

PΔ captures the value added by personal calibration. A device running only the base model has PΔ = 0. As the personal adapter learns the user's patterns, preferred reasoning styles, domain knowledge, and communication preferences, PΔ rises. A device with PΔ = 0.25 delivers responses 25% more useful on user-specific tasks than the same base model without calibration.

This component is what makes IpMW a metric for personal AI rather than just efficient AI. Two devices with identical R scores can have very different real-world value if one knows the user and the other does not. PΔ makes that difference visible in the benchmark.

Degradation Honesty (DH)

DH is the most novel component and deserves the most careful definition. It measures the calibration of the system's uncertainty — specifically, how accurately the stated coverage score predicts actual performance when the system operates outside its well-covered domains.

A system that scores DH = 1.0 is perfectly calibrated: when it says 80% coverage, it achieves 80% accuracy on those tasks. A system that says 95% coverage but achieves 60% accuracy, or says 40% coverage but achieves 85% accuracy, scores poorly on DH. The goal is not maximum confidence — it is accurate confidence. A system that honestly says "I don't know" is more useful than one that confidently gives a wrong answer, and IpMW rewards that honesty explicitly.

→ [3, 5]


Section V
Architectural Implications

IpMW is not just a measurement tool. It is a design objective. The architectural choices that maximize IpMW are different from those that maximize tokens per second, and those differences tell us what edge AI hardware should look like.

Diagram 01 · Architectural Optimization Map Design choices and their effect on IpMW
↑↑ IpMW
Sparse MoE activation — activate 10–20% of experts per query
Reduces AvgPower dramatically while preserving R within covered domains. Sparse activation is the single highest-leverage architectural choice for IpMW.
↑↑ IpMW
Usage-profile quantization — high precision for frequent experts, aggressive INT4 for fallbacks
Reduces memory bandwidth and power for rarely-used experts while preserving quality in the user's actual domain profile. Improves both numerator and denominator.
↑ IpMW
Behavioral adapter training — no supervised loss, no external gradient signal
Incremental on-device adapter updates have near-zero inference-time cost once trained. PΔ rises continuously without increasing AvgPower during inference.
↑ IpMW
Coverage score computation — lightweight router confidence aggregation
A well-calibrated coverage score improves DH at minimal compute cost. The coverage computation itself adds only a few percent to inference overhead while substantially improving DH.
↑ IpMW
Fixed frozen interpreter head — non-trainable cross-attention, compiled at device build time
A frozen interpreter head eliminates training-time compute overhead. Its inference cost is bounded and predictable, contributing to AvgPower stability.
↓ IpMW
Full model activation on every query
Activates all parameters regardless of query domain. Increases AvgPower without proportional increase in R for domain-specific tasks. Tokens-per-second optimization leads here.
↓ IpMW
Uniform precision across all weights
Wastes memory bandwidth and power maintaining high precision for rarely-accessed parameters. Increases AvgPower without benefit to the user's actual query profile.
↓↓ IpMW
Overconfident responses in low-coverage domains
Destroys DH score regardless of R quality. A system that hallucinates confidently has IpMW near zero on affected queries. The worst possible outcome for the metric and for the user.
~ IpMW
Hardware secure enclave overhead
Adds 5–15% to AvgPower depending on implementation. Reduces IpMW slightly. Accepted as a necessary cost for the privacy guarantee — not optimized away.

The optimization map reveals something important: the architectural choices that maximize IpMW are precisely the choices that define the broadcast sync architecture described in the companion paper. Sparse MoE world models, usage-profile quantization, behavioral adapter training, coverage score computation, and a frozen interpreter head are not incidental features — they are the design choices that emerge from optimizing for the right metric.

This is not a coincidence. The architecture was designed for the edge deployment context, and IpMW is the metric that captures that context. Tokens per second would have produced a different architecture. The metric defines the design space.

→ [1, 2, 3, 6]


Section VI
Benchmark Methodology

A metric is only as useful as its measurement methodology. IpMW requires a standardized benchmark suite to be comparable across devices and architectures. The following outlines what that suite should contain.

Diagram 02 · IpMW Benchmark Pipeline Four-stage measurement methodology
Stage 01
Hardware Characterization
Full system power measurement. CPU + NPU + memory + enclave. Average across 100 complete inference cycles. Report mean ± std dev.
Stage 02
Reasoning Benchmark
MMLU subset within declared coverage domains. 200 questions minimum. Score normalized to reference cloud baseline = 1.0.
Stage 03
Calibration Suite
50 queries in low-coverage domains. Compare stated coverage score to observed accuracy. Compute DH = 1 − MAE.
Stage 04
IpMW Score
Apply formula with specified α, β, γ weights. Report both raw components and composite. Specify weight configuration used.
Reference hardware: Benchmark results must specify the reference platform (device model, chip generation, OS version, thermal state) and the weight configuration (α, β, γ) used. Results from different weight configurations are not directly comparable. The personalization delta (PΔ) requires a calibrated device — specify number of interaction hours used for adapter training before measurement.

The benchmark has one deliberate complexity: PΔ requires a calibrated device. A freshly deployed device with empty adapters has PΔ = 0 by definition. Published benchmarks should specify the calibration state — typically 10, 50, or 100 hours of interaction — and report PΔ at each milestone. This makes the benchmark more complex but more honest: it reveals how quickly the device becomes useful for a specific user, not just how capable it is out of the box.

Reference hardware for normalization should be updated annually as the frontier advances. The 2026 reference baseline is a capable cloud model at full precision. As edge devices approach cloud parity in covered domains, R scores will naturally rise. The metric scales with the frontier rather than becoming obsolete as hardware improves.

→ [3, 4, 5, 8]


Section VII
The Metric Defines the Era

Metrics are not neutral. They are design forces. The metric an industry optimizes for determines what gets built, what gets funded, and what gets rewarded. Tokens per second built data centers. TOPS built NPUs. Neither metric built the device that knows you, runs offline, and lasts all day. Intelligence per milliwatt builds that device.

The historical precedent is instructive. Before fuel economy standards, automobile engines were optimized for horsepower because that was what manufacturers competed on and what buyers nominally wanted. The introduction of miles per gallon as a regulatory and consumer metric forced a rethinking of engine architecture from first principles. Technologies that had existed for decades became economically compelling overnight because the metric now rewarded them. The same dynamic is available to edge AI hardware.

In 2026, edge AI chips are already competing on efficiency — the industry consensus is that efficiency is king, and the winners will deliver the most AI per joule. But without a standardized metric that captures the full picture of what "intelligence" means at the edge — reasoning, personalization, and honest uncertainty — competition optimizes for partial proxies. TOPS is a partial proxy. Tokens per second is a partial proxy. IpMW is not a partial proxy. It measures what the user actually receives.

Whoever defines this metric first, and builds hardware that wins on it, defines the next generation of edge AI silicon. The metric creates the battleground. The battleground creates the winners. The winners shape what AI looks like for the next decade of devices.

The question is not whether AI will run at the edge. It will. The question is which metric defines what "better" means when it gets there. That question is still open. This paper proposes an answer.

→ [1, 2, 6, 7]


Appendix
Tokens Per Second vs Intelligence Per Milliwatt
Full Metric Comparison Tokens/sec vs IpMW across key properties
Property Tokens Per Second Intelligence Per Milliwatt
Metric type Supply-side Measures hardware output Demand-side Measures user value received
Optimized for Cloud Data center batch serving Edge On-device personal inference
Power sensitivity None — does not account for power draw Central — power is the denominator
Personalization Invisible — a generic model and a calibrated one score identically Explicit — PΔ component captures personal calibration value
Uncertainty honesty Invisible — hallucination and accurate response score the same Explicit — DH component penalizes overconfident hallucination
Architectural reward Full model activation, high memory bandwidth, large batch size Sparse MoE, usage-profile quantization, behavioral adaptation
Binding constraint Compute throughput and memory bandwidth Power envelope and thermal budget
Measurement scope Single inference pass, often NPU or GPU only Full system across complete inference cycle including enclave
Scales with frontier Requires recalibration as model sizes grow Normalized to reference baseline — scales naturally
Analogy Horsepower — measures engine output Miles per gallon — measures value delivered per resource consumed

References
  1. [1]Edge AI and Vision Alliance (2026). On-Device LLMs in 2026: What Changed, What Matters, What's Next. — Industry consensus that efficiency is the defining competitive dimension for edge AI hardware.
  2. [2]IEA (2024). Energy and AI: Energy Demand from AI. iea.org. — Global data center electricity projected to double to 945 TWh by 2030; inference dominates ongoing energy cost.
  3. [3]Horvat, J. (2026). Intelligence at the Edge of the Cloud: A Local Architecture for the AI Era. Public domain. — Introduced the IpMW concept and the coverage score mechanism that grounds the DH component.
  4. [4]Hendrycks, D. et al. (2021). Measuring Massive Multitask Language Understanding. arXiv:2009.03300. — MMLU benchmark used as reference suite for the Reasoning component.
  5. [5]Guo, C. et al. (2017). On Calibration of Modern Neural Networks. ICML 2017. arXiv:1706.04599. — Foundational work on model calibration; grounds the Degradation Honesty component in established calibration methodology.
  6. [6]Apple Machine Learning Research (2025). Apple Intelligence Foundation Language Models Tech Report. machinelearning.apple.com. — On-device 3B parameter model with LoRA adapters; reference implementation for IpMW-relevant architectural patterns.
  7. [7]Dai, D. et al. (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066. — Sparse MoE as dominant efficient architecture; grounds the sparse activation architectural implication.
  8. [8]Tummalapalli, P. et al. (2026). LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load. arXiv:2603.23640. — Benchmarks Qwen 2.5 1.5B across iPhone 16 Pro, Samsung S24 Ultra, Hailo-10H NPU and laptop GPU; demonstrates thermal management as primary constraint under sustained inference load, directly motivating full-system power measurement in the IpMW denominator.
Public domain. No rights reserved. Build freely.
Jonathan Horvat · 2026
Paper 04 — Companion Applying HAZOP to AI Systems
Position Paper · Process Safety · AI Systems
Applying HAZOP to AI Systems
What Process Engineering Knows About Failure That AI Safety Doesn't
Jonathan Horvat
2026 · Public domain · No rights reserved
Abstract

Process engineering has spent sixty years developing formal methodologies for identifying and managing failure modes in complex systems where errors have physical consequences. The most widely used of these — Hazard and Operability Study, or HAZOP — is a structured, node-by-node analysis that asks what happens when each component of a system deviates from its design intent. The methodology produces a formal register with specified causes, consequences, safeguards, and required actions. That register has legal standing. It gets reviewed. When the system changes, it gets updated.

AI systems deployed in consequential contexts — physical AI, medical AI, infrastructure AI, any AI where the output affects the physical world — have no equivalent. Failure modes are discovered in production. Responses are improvised. Nobody decided how the system fails. This paper argues that HAZOP applies directly to AI systems, demonstrates the methodology on a specific architecture, and proposes a standard for what an AI HAZOP should produce.


This paper is a companion to: Horvat, J. (2026). Intelligence at the Edge of the Cloud: A Local Architecture for the AI Era. Public domain.

Section I
Two Industries, One Problem

In process engineering, the question of how a system fails is not left to chance or discovered in production. It is decided — deliberately, formally, at the design stage — before a single valve is specified or a pipe is routed. A cooling water valve fails open because losing cooling is the catastrophic outcome. A fuel supply valve fails closed because uncontrolled fuel flow is the catastrophic outcome. The failure mode is a design decision, made by a process engineer, documented in the process design package, and enforced in hardware. The valve does not decide how to fail. The system designer decides.

This principle — that failure modes must be specified, not discovered — is the foundation of process safety. It emerged from catastrophic industrial accidents in the mid-twentieth century, most notably Flixborough in 1974, where a temporary pipe bypass failed under pressure, killing 28 people and destroying the plant. The accident investigation revealed that nobody had formally analyzed what would happen if the bypass failed. The consequences were not unknown — they were simply unasked.

AI systems in 2026 are in approximately the same position that chemical plants were in 1963, when ICI first developed the operability study methodology. Complex. Consequential. Widely deployed. And almost entirely without formal failure mode analysis. When a cloud-dependent AI system loses connectivity, what happens? When the model receives data it was not trained on? When latency exceeds acceptable bounds? In most deployments, nobody decided. The system does whatever it does, and the response is improvised.

Process engineering solved this problem sixty years ago. The methodology transfers directly. The only thing missing is the decision to apply it.

→ [1, 2]


Section II
HAZOP Fundamentals

A Hazard and Operability Study is a structured examination of a system, conducted by a multidisciplinary team, that systematically identifies all credible deviations from design intent and their consequences. It produces a formal register — the HAZOP register — that documents every identified hazard, its causes, consequences, existing safeguards, and required actions. The register is the deliverable. It has legal standing in regulated industries. It is updated when the system changes.

The methodology proceeds in three steps. First, the system is divided into nodes — discrete sections, each with a defined design intent. In process piping, nodes are typically pipe sections or vessels. In AI systems, nodes are functional components. Second, parameters relevant to each node are identified — in process systems these are physical quantities like flow, pressure, temperature, and composition. In AI systems the equivalents are data, signal, weights, context, connectivity, latency, and confidence. Third, guide words are applied systematically to each parameter to generate deviations. The guide words are the engine of the methodology.

HAZOP Guide Words Standard definitions with process and AI equivalents
Guide WordDefinitionProcess ExampleAI Equivalent
No / NoneComplete negation of design intentNo flow in a transfer lineNo sync received · No data to model · No response generated
MoreQuantitative increase beyond designHigh pressure in vesselMore queries than capacity · More data than expected · Higher confidence than warranted
LessQuantitative decrease below designLow flow in cooling lineLess training data · Lower sync frequency · Reduced coverage
As Well AsQualitative addition to design intentContamination in process streamPoisoned weights alongside valid ones · Personal data in world model sync
Part OfQualitative reduction of design intentPartial composition onlyIncomplete sync · Partial adapter update · Truncated response
ReverseLogical opposite of design intentReverse flow in pipeModel transmits instead of receives · Adapter trains on wrong signal
Other ThanComplete substitution of design intentWrong chemical in lineWrong model version synced · Query routed to wrong expert · Wrong user context applied

Not every guide word applies to every parameter at every node. The methodology does not require exhaustive application — it requires systematic application of all credible deviations. A combination that produces no meaningful deviation is simply not recorded. The skill is in identifying which combinations matter, which requires domain knowledge of both the methodology and the system being analyzed.

→ [1, 3]


Section III
Translating HAZOP to AI

The translation from process parameters to AI parameters is direct. Each physical quantity in a process system has a functional equivalent in an AI system. The mapping is not metaphorical — it is structural. Both systems have inputs, throughputs, transformations, and outputs. Both have design intents that can be deviated from. Both have consequences when deviations occur.

Parameter Translation Process engineering → AI system equivalents
Process ParameterAI EquivalentDesign Intent Example
FlowData / Signal / Query rateQueries arrive at expected rate within model capacity
PressureLoad / Demand / Inference queue depthInference queue depth remains within latency budget
TemperatureLatency / Thermal state of deviceResponse latency within acceptable bounds; device within thermal envelope
CompositionModel weights / Version / Data distributionWeights are current, validated, and match expected architecture version
LevelMemory / Storage / Context window occupancyContext window occupancy within model capacity; storage sufficient for adapters
ConnectivityNetwork / Sync channel availabilitySync channel available for scheduled weight updates
ReactionAdapter training / Model updateAdapter weights update through behavioral observation within expected bounds
SignalConfidence score / Coverage score / Output certaintyCoverage score accurately reflects actual domain coverage

The node structure for an AI system follows the same logic as process piping — each node is a functional section where the design intent can be stated clearly enough to generate meaningful deviations. For the broadcast sync architecture analyzed in this paper, six nodes are defined: the sync channel, the world model, the personal adapters, the interpreter head, the coverage score mechanism, and the user interface. Each has a clear design intent. Each has parameters that can deviate. Each deviation has consequences that can be specified.


Section IV
The HAZOP Register Applied

The following register applies the HAZOP methodology to the broadcast sync architecture described in Intelligence at the Edge of the Cloud. The register follows the standard nine-column format used in process industries: Node, Parameter, Guide Word, Deviation, Cause, Consequence, Existing Safeguards, Action Required, and Responsibility. Severity is indicated as H (High — safety or major function impact), M (Medium — degraded operation), or L (Low — minor impact, self-correcting).

Only credible deviations are recorded. Combinations that produce no meaningful consequence are omitted. The register is not exhaustive — a full production HAZOP would require a multi-day workshop with domain specialists. This register demonstrates the methodology and identifies the most significant hazards.

AI HAZOP Register — Broadcast Sync Architecture H = High · M = Medium · L = Low severity
Node Parameter Guide Word Deviation Cause Consequence Existing Safeguards Action Required Resp.
Node 1 · Sync Channel — Design intent: World model weights broadcast periodically from server to device; device retains relevant experts locally
N1 Connectivity No No sync received — extended offline period Network unavailable; server offline; device in coverage gap M World model knowledge becomes stale. Coverage score degrades over time. System continues to function on existing weights. Coverage score timestamps experts; DH component of IpMW degrades gracefully. User is informed of sync date. Define maximum acceptable staleness threshold. Specify behavior when threshold exceeded — should system reduce confidence scores automatically? Arch. Design
N1 Composition As Well As Malicious weights alongside valid weights in broadcast Compromised sync server; supply chain attack on model weights; adversarial weight injection H Corrupted world model behavior. Potential for targeted misinformation, capability degradation, or covert data exfiltration via model behavior. Human-gated publication — weights require deliberate authorization before broadcast. No automatic write path from training to sync server. Implement cryptographic signing of weight packages. Device verifies signature before integrating sync. Define process for emergency weight revocation. Security / Ops
N1 Composition Part Of Incomplete sync — partial expert package received Network interruption mid-sync; storage limit reached on device; timeout during large update L Expert partially updated. Router may direct queries to expert with inconsistent weight version. Minor capability degradation in affected domain. Delta-patching protocol; version tracking per expert. Incomplete sync detected by checksum mismatch. Specify atomic sync behavior — either full expert package integrates or none does. No partial expert state permitted. Arch. Design
N1 Flow More Sync frequency higher than intended — excessive update rate Server misconfiguration; CDN cache invalidation loop; client retry storm L Excessive bandwidth consumption. Battery drain from repeated sync activity. No safety consequence. Client-side rate limiting on sync requests. Exponential backoff on retry. Define maximum sync frequency. Implement server-side rate limit per device cohort. Platform Eng.
Node 2 · World Model — Design intent: Sparse expert model responds to queries from augmented personal model; pull-only; no personal data access; no transmission
N2 Signal Reverse World model initiates communication rather than responding to queries Implementation defect; compromised runtime; malicious weight injection enabling active behavior H Fundamental breach of privacy architecture. World model could exfiltrate personal context if it can initiate outbound communication. Architectural constraint — world model has no transmission interface by design. OS-level network policy blocks outbound requests from world model process. Enforce OS-level network isolation for world model process. Include in security audit scope. Add runtime assertion that world model never opens a network socket. Security
N2 Flow No World model unresponsive — no output to interpreter head Model crash; out-of-memory; thermal throttling; corrupt expert weights M Augmented personal model cannot synthesize world knowledge. System falls back to personal adapters only. Response quality degrades to base adapter capability. Coverage score detects world model unavailability. User informed of degraded state. Personal adapters continue functioning. Define explicit fallback behavior when world model is unavailable. Specify whether system should respond from personal adapters alone or decline the query. Arch. Design
N2 Composition Other Than Wrong expert activated for query domain Router miscalibration; query domain ambiguous; expert boundary overlap in training M Response draws on incorrect domain knowledge. Coverage score may not detect mismatch if router is confidently wrong. Plausible but incorrect output. Coverage score includes router activation confidence. Low confidence triggers uncertainty signal to user. Add domain classification validation layer between router and expert activation. Specify minimum router confidence threshold below which coverage score is automatically reduced. ML Design
Node 3 · Personal Adapters — Design intent: Low-rank adapter weights encode user patterns through behavioral observation only; never transmitted; updated incrementally on-device
N3 Flow Reverse Personal adapter weights transmitted off-device Implementation defect; malicious application accessing adapter via OS vulnerability; secure enclave breach H Complete privacy breach. Personal adapter contains behavioral fingerprint of user. Transmission exposes identity, patterns, and potentially sensitive context. Hardware secure enclave isolates adapter weights. No API exposes raw adapter weights externally. OS-level access controls. Include adapter transmission prevention in security audit. Add runtime assertion monitoring for any adapter data leaving enclave boundary. Third-party security verification. Security
N3 Reaction More Adapter over-trains — excessive adaptation to recent interaction pattern Unusual interaction session; adversarial prompt sequence designed to manipulate adapter; rapid behavior change in user M Adapter over-fits to recent session. Prior user context underweighted. System behaves inconsistently across sessions. Potential for adversarial manipulation of personal model behavior. Incremental update rate limiting. Exponential decay on adapter update magnitude. Define maximum adapter update magnitude per session. Specify drift detection mechanism — if adapter weights change beyond threshold in single session, flag for review and limit update rate. ML Design
N3 Reaction No Adapter fails to update — no behavioral learning occurs Observation signal below update threshold; enclave compute unavailable; adapter storage full L Personalization delta (PΔ) remains at zero. System functions as base model only. No safety consequence — base model remains fully functional. Coverage score reports personalization state. User can be informed if adapter calibration is not progressing. Define minimum interaction rate required for adapter update. Specify storage management policy for adapter weights. Platform Eng.
Node 4 · Interpreter Head — Design intent: Fixed non-trainable cross-attention module translates between world model and personal model embedding spaces; frozen at compile time
N4 Composition Other Than Interpreter head version mismatch — compiled against different base model version than currently installed Base model sync updates embedding space geometry; interpreter head not recompiled; version management failure M Cross-attention keys and queries misaligned across embedding spaces. Synthesis quality degrades. Coverage score may not detect geometric mismatch. Subtle but pervasive response quality degradation. Version locking between base model and interpreter head. Sync rejected if interpreter head version does not match base model version. Treat interpreter head as tightly coupled to base model version. Any base model sync must trigger interpreter head recompilation or sync rejection. Define versioning protocol. Arch. Design
N4 Signal No Interpreter head produces no output — cross-attention fails Null world model output passed to cross-attention; dimension mismatch; numerical instability L Synthesis step fails. System falls back to personal model response without world model integration. Response quality degrades but system does not crash. Null output detection in synthesis pipeline. Fallback to personal model response with coverage score indicating world model unavailable. Implement explicit null-output handler in interpreter head. Specify fallback behavior. Add to integration test suite. ML Design
Node 5 · Coverage Score — Design intent: Weighted combination of router confidence, expert freshness, and domain relevance produces accurate uncertainty signal surfaced to user
N5 Signal More Coverage score overestimates actual coverage — confident signal in low-coverage domain Router overconfidence; calibration drift; distribution shift in query domain not detected by router H User receives confident response in domain where model is unreliable. Hallucination presented as high-confidence output. Degrades DH component of IpMW to near zero. Most dangerous failure mode in the architecture. None beyond router confidence — this is the primary open risk in the architecture. Implement external calibration validation. Periodically test coverage score accuracy against known-answer queries in boundary domains. Alert if DH drops below threshold. This is the highest-priority action item in this register. ML Design / QA
N5 Signal Less Coverage score underestimates actual coverage — false modesty in well-covered domain Router underconfidence; conservative calibration; stale freshness timestamp despite good expert quality M User unnecessarily warned of uncertainty in domain where model is reliable. Erodes trust in the coverage score signal. System appears less capable than it is. User experience degradation only. No safety consequence. Include underconfidence detection in calibration testing. Tune freshness decay function to avoid excessive staleness penalty for high-quality experts. ML Design
Node 6 · User Interface — Design intent: Query received from user; response and coverage score delivered to user; no personal context transmitted externally
N6 Flow As Well As Personal context transmitted alongside query to external service Third-party application integration that logs queries; UI layer sending context to analytics endpoint; developer error in application built on architecture H Privacy architecture bypassed at the application layer. Personal context reaches external infrastructure despite architectural protections in the model layer. Architecture prevents model-layer transmission. Application layer is outside the core trust boundary. Publish clear application developer guidelines prohibiting query logging with personal context. Define privacy API contract. Consider OS-level enforcement of query privacy for applications using the architecture. Platform / Policy
N6 Signal No Coverage score not surfaced to user — uncertainty signal suppressed Application developer chooses not to display coverage score; UI implementation omits uncertainty display M User receives confident-appearing response with no uncertainty signal. Degradation honesty benefit of architecture lost at presentation layer. Effectively equivalent to architectures without coverage score. None — application layer decision. Specify minimum coverage score display requirement in API contract. Applications that suppress coverage score below threshold should surface a generic uncertainty indicator. Platform / Policy

The register contains 14 credible deviations across 6 nodes, with severity ratings of High (5), Medium (7), and Low (2). The most significant finding — rated High severity with no existing safeguard — is the coverage score overconfidence scenario in Node 5. This is the architecture's primary open risk: a router that confidently directs queries to an unsuitable expert produces a confident-appearing response in a domain where the model is unreliable, and the coverage score fails to warn the user. This is not a flaw unique to this architecture. It is the central unsolved problem in AI uncertainty quantification, surfaced explicitly by the HAZOP methodology.

→ [4, 5]


Section V
What the Register Reveals

The HAZOP register confirms that the broadcast sync architecture was designed with conservative failure in mind. The majority of identified deviations have existing safeguards — the architecture degrades gracefully rather than failing catastrophically. This is consistent with a design that treats offline operation and uncertainty as first-class properties rather than edge cases.

But the register also surfaces three genuine open questions that the architecture does not fully answer.

01
Coverage score overconfidence has no architectural safeguard
The Node 5 High severity finding — router overconfidence producing a false coverage signal — is the primary open risk. The architecture relies on the router being well-calibrated, but router calibration drift is a known failure mode in sparse MoE systems. An external calibration validation loop, periodically testing coverage score accuracy against known-answer queries, is the required action but is not yet specified in the architecture.
02
Failure mode for world model unavailability is not formally specified
When the world model is unresponsive (Node 2, No Flow), the system falls back to personal adapters only. This is the right behavior — conservative degradation rather than failure. But the specific behavior is not formally specified as a design decision. Should the system respond from adapters alone, or decline queries in domains that require world model knowledge? The answer depends on the deployment context and must be decided at the process design level, not the implementation level.
03
Application layer privacy enforcement is outside the trust boundary
The Node 6 High severity finding — personal context transmitted via application layer — exposes a gap between the architecture's strong model-layer privacy guarantee and the weaker application-layer enforcement. The architecture cannot prevent a developer from logging queries with personal context. This is the process engineering equivalent of a plant with excellent internal safety but no control over what happens after the product leaves the gate. A platform-level privacy API contract is required.

→ [4, 5, 6]


Section VI
The Failure Mode Decision

The most important principle in process safety is not the HAZOP itself. It is the principle that underlies it: failure modes must be decided, not discovered. In process engineering, how a system fails is a design decision made by a qualified engineer at the earliest stage of design, before hardware is specified. The valve does not decide how to fail. The process engineer decides, based on which failure mode is safest for that specific point in the system, at that specific operating condition, in that specific process context.

This decision is then documented, reviewed, challenged by a multidisciplinary team, and enforced in hardware. If the process changes, the failure mode analysis is revisited. There is always an answer to the question: who decided how this fails, and when?

AI systems have no equivalent. The question of how an AI system fails — what it does when connectivity is lost, when it receives out-of-distribution input, when its confidence is wrong, when the user acts on an incorrect response — is almost never asked at the design stage. It is answered in production, after something goes wrong, by whoever happens to be responsible at the time.

This is not a criticism of AI developers. It reflects the absence of a methodology. Process engineering did not develop conservative failure mode design through intuition — it developed it through catastrophic accidents and the formal frameworks those accidents inspired. The AI industry has not yet had its Flixborough. As AI systems become more physically consequential — controlling vehicles, managing medical devices, operating infrastructure — it will.

The question is not whether AI systems will fail. They will. The question is whether someone decided how they fail, before they did.

→ [1, 2, 6]


Section VII
A Proposed Standard

The following outlines what an AI HAZOP standard should require for consequential deployments. Consequential is defined as any AI system where the output directly influences a physical, medical, financial, or safety-critical decision.

Proposed AI HAZOP Standard Elements
When
Conducted before initial deployment of any consequential AI system. Revalidated when the system architecture changes materially — new model version, new deployment context, new integration point. An as-built HAZOP conducted before any production launch.
Who
Multidisciplinary team including: ML systems engineer, domain expert for the deployment context, safety engineer or process engineer familiar with the methodology, end-user representative, and independent HAZOP facilitator. Minimum five participants. Maximum team size twelve.
Nodes
System divided into functional nodes, each with a stated design intent. Minimum nodes: data input, model inference, output generation, user interface, external integrations, update/sync mechanism. Additional nodes for domain-specific components.
Parameters
Data, signal, weights/version, context, connectivity, latency, confidence, and any domain-specific parameters. Guide words applied systematically to each credible parameter at each node.
Register
Nine-column register in standard format: Node, Parameter, Guide Word, Deviation, Cause, Consequence, Existing Safeguards, Action Required, Responsibility. Severity rating required for each row. Register is a controlled document — versioned, dated, and updated when system changes.
Failure Modes
For each High severity deviation with inadequate safeguards, the team must specify the intended failure mode explicitly: fail-safe (conservative degradation), fail-operational (continue at reduced capability), or fail-stop (cease operation). The decision must be documented and the rationale recorded.
Ownership
Every action item in the register must have a named responsible party and a completion date. The register is not complete until all High severity items either have adequate safeguards or have a documented decision to accept the risk with named acceptance authority.
Review Cycle
Annual revalidation for deployed systems. Immediate revalidation triggered by: any High severity finding from production incidents, any material architecture change, any change in deployment context that introduces new failure modes.

This standard does not require new technology. It requires applying a methodology that has been refined over sixty years to a new class of system. The tools — guide words, node analysis, register format — are the same. The parameters are different. The discipline required is the same.

Process engineering did not invent the concept of conservative failure mode design. It systematized it. The AI industry needs to do the same, and it does not need to wait for a catastrophic accident to begin.

→ [1, 2, 3, 7]


References
  1. [1]IEC 61882:2016. Hazard and Operability Studies (HAZOP Studies) — Application Guide. International Electrotechnical Commission. — The international standard governing HAZOP methodology.
  2. [2]Kletz, T. A. (1999). HAZOP and HAZAN: Identifying and Assessing Process Industry Hazards. 4th ed. Institution of Chemical Engineers. — Foundational text on HAZOP methodology by the engineer who systematized it.
  3. [3]Crawley, F. & Tyler, B. (2015). HAZOP: Guide to Best Practice. 3rd ed. Elsevier. — Standard reference for HAZOP register format and node analysis methodology.
  4. [4]Horvat, J. (2026). Intelligence at the Edge of the Cloud: A Local Architecture for the AI Era. Public domain. — Primary architecture paper; defines the six nodes analyzed in this register.
  5. [5]Guo, C. et al. (2017). On Calibration of Modern Neural Networks. ICML 2017. arXiv:1706.04599. — Grounds the coverage score overconfidence finding in established calibration literature.
  6. [6]Amodei, D. et al. (2016). Concrete Problems in AI Safety. arXiv:1606.06565. — AI safety framework; HAZOP addresses several of the identified problem categories through formal process engineering methodology.
  7. [7]Health and Safety Executive (HSE) (1999). A Guide to the Control of Major Accident Hazards Regulations. — Regulatory framework that established HAZOP as legally required methodology for major hazard facilities; proposed model for AI regulation.
Public domain. No rights reserved. Build freely.
Jonathan Horvat · 2026
Paper 05 — Companion The AI P&ID
Position Paper · Engineering Standards · AI Systems
The AI P&ID
A Notation Standard for Documenting Consequential AI Systems
Jonathan Horvat
2026 · Public domain · No rights reserved
Abstract

Process engineering documents complex systems using Piping and Instrumentation Diagrams — drawings that show not just topology but behavior, failure modes, control loops, instrument tags, and version history. These properties are what make formal safety analysis possible. AI system architecture is currently documented with block diagrams that show topology and nothing else. As AI systems become more physically consequential, block diagrams are insufficient. This paper proposes an AI P&ID notation standard — a symbol library, tag numbering convention, and drawing format adapted from ANSI/ISA-5.1-2024 and ISO 10628 — that gives AI systems the documentation discipline that process engineering developed over seventy years. The standard is demonstrated on the broadcast sync architecture from the companion paper series, producing the first AI P&ID in the proposed notation.


This paper is a companion to: Horvat, J. (2026). Intelligence at the Edge of the Cloud: A Local Architecture for the AI Era. Public domain.

Section I
Why Block Diagrams Are Not Enough

A process engineer handed a block diagram of a chemical plant would not be able to conduct a HAZOP, write an operating procedure, specify a control system, or approve the design for construction. A block diagram shows components and connections. It does not show what flows between components, what data types those connections carry, how each component fails, what instruments monitor what parameters, what interlocks prevent unsafe states, or what version of the design is being reviewed. It is a sketch, not a document.

AI system architecture documentation in 2026 is almost entirely block diagrams. Boxes connected by arrows, sometimes with labels. When teams use more sophisticated tools they produce UML diagrams or data flow diagrams — both of which are improvements, but neither of which captures the properties required for safety-critical system documentation. None of them show failure modes. None of them show what happens when a connection is lost. None of them have a tag numbering system that connects the diagram to a HAZOP register. None of them have formal version control in the engineering sense.

This is not a criticism of the people producing these diagrams. There is no standard. There are no conventions. Nobody has defined what a complete AI system diagram should contain. The result is that every team invents its own notation, diagrams are not comparable across organizations, and the formal safety analysis that requires a precise drawing cannot be done.

A block diagram tells you what a system is. A P&ID tells you how it behaves, how it fails, and who is responsible for each part of it. Consequential AI systems need the second kind of document.

Block Diagram vs AI P&ID Properties comparison
PropertyBlock DiagramAI P&ID (Proposed)
Component topology Shows components and connections Shows components and connections
Data types on connections Not specified Data type, size, and direction annotated on every line
Failure modes Not shown Every component shows fail-safe, fail-operational, or fail-stop
Instrument tags No tagging convention ISA-adapted tag numbering for every instrument and control loop
Control loops Not represented Monitoring instruments, controllers, and final control elements shown
Security boundaries Implied at best Enclave boundaries shown as defined zones with access annotations
Version control Informal or absent Revision block, date, change history, management of change required
HAZOP support Insufficient for node analysis Node boundaries defined; parameters readable from drawing
Legal standing Informal communication tool Controlled document with formal change management

→ [1, 2]


Section II
What a P&ID Actually Contains

A Piping and Instrumentation Diagram, as defined by ANSI/ISA-5.1-2024 and ISO 10628, is a detailed drawing of a process system showing all equipment, piping, instrumentation, and control systems with sufficient detail to support design, construction, operation, and safety analysis. It is not a schematic. It is not a concept diagram. It is an engineering document.

The properties that make a P&ID useful for safety analysis are precisely defined by the standards. Equipment is shown with its tag number, size, material of construction, and design conditions. Pipe lines are shown with their size, schedule, material, insulation, and direction of flow. Instruments are shown with tag numbers following the ISA letter-based convention — a pressure indicator is PI, a flow controller is FC, a temperature transmitter is TT. Each tag includes a loop number that connects it to the instrument data sheet and the control system documentation.

Most importantly for safety analysis, every valve shows its failure mode — fail closed (FC), fail open (FO), or fail in last position (FL). Every instrument shows its action on signal failure. Every safety interlock is shown with its trip setpoint. Every relief device is shown with its set pressure. The drawing contains enough information to ask, for every component: what happens when this fails, and does the system respond safely?

The drawing also has a title block with revision history. Every change to a P&ID triggers a management of change process. The revision number, date, description of change, and responsible engineer are recorded. You cannot change a P&ID informally. The version history is part of the document.

→ [1, 2, 3]


Section III
The AI P&ID Symbol Library

The following symbol library defines the graphical elements required to document AI systems in the proposed notation. Symbols follow the ISA convention of using circles for instruments, boxes for equipment, and lines for connections, adapted for AI-specific element types. Where no process equivalent exists, new symbols are defined with rationale.

AI P&ID Symbol Library — Proposed v1.0 Adapted from ANSI/ISA-5.1-2024 · ISO 10628
001
Symbol Tag Element Name Description Failure Mode Annotation ISA/Process Analog
Category 01 · Models — Primary processing elements
WM WM-XXX World Model Sparse expert inference model. Pull-only. Receives weight syncs. Responds to queries from augmented personal model only. FL — Fail last known weights. Continues serving from current weights if sync unavailable. Process vessel / reactor — primary transformation element
APM APM-XXX Augmented Personal Model On-device model with frozen base, personal adapters, and interpreter head. Inside secure enclave. Primary user-facing element. FL — Fail last state. Responds from personal adapters if world model unavailable. Process vessel with internal components — complex transformation element
IH IH-XXX Interpreter Head Fixed non-trainable cross-attention module. Frozen at compile time. Translates between world model and personal model embedding spaces. Internal to APM. FO — Fail open to personal model only. If cross-attention fails, synthesize from personal adapters without world model input. Heat exchanger / translator — embedded transformation element
Category 02 · Adapters — Trainable parameter sets
PA PA-XXX Personal Adapter Set Low-rank adapter layers encoding user patterns. Trainable. Updated by behavioral observation only. Never transmitted. Inside secure enclave. FL — Fail last state. Adapter weights frozen if training signal unavailable. System continues at current calibration level. Control valve with positioner — trainable final control element
Category 03 · Instruments — Monitoring and measurement elements
CI 001 CI-XXX Coverage Indicator Monitors and displays current domain coverage score. Reads router activation confidence, expert freshness, and domain relevance. Surfaces signal to user interface. FL — Fail to last known coverage state. If coverage computation fails, display last valid score with staleness timestamp. Level indicator / analyzer — passive readout instrument (ISA circle symbol)
ST 001 ST-XXX Sync Transmitter Monitors sync channel status — last sync timestamp, sync success/failure, bytes received. Transmits status signal to coverage indicator and system monitor. FL — Fail last state. If sync monitoring fails, retain last known sync timestamp. Flow transmitter / status transmitter — signal originating instrument (ISA circle)
MA MA-XXX Model Analyzer Monitors world model version, expert inventory, and calibration state. Analogous to a process analyzer — reads composition of active expert set. FL — Fail last inventory. If model analyzer fails, retain last known expert inventory list. Composition analyzer — quality measurement instrument (ISA circle with A)
Category 04 · Connections — Data flow lines
Primary Data Flow Main inference data path. Annotated with data type, approximate size, and direction. Heavy solid line. N/A Main process line — primary fluid flow
Signal / Control Line Instrument signal, control output, or monitoring connection. Dashed line. Used for coverage score output, monitoring signals, and control loops. N/A Instrument signal line — electrical or pneumatic signal
Sync Channel Weight broadcast channel from sync server to device. One-directional. Annotated with sync frequency, payload size, and anonymization method. N/A — channel loss results in FL behavior on world model Utility line — secondary service connection
Category 05 · Boundaries — Security and system zones
ENCLAVE SE-XXX Secure Enclave Boundary Hardware secure enclave boundary. Dashed red rectangle. All elements inside this boundary are hardware-isolated. No data crosses boundary without explicit interface definition. FC — Fail closed. If enclave is compromised, all contained elements cease operation. No degraded operation outside enclave boundary. Battery limit / system boundary — physical plant boundary
DEVICE DB-XXX Device Boundary Physical device boundary. Light dashed rectangle. Defines what is on-device vs off-device. All primary inference operations should be within this boundary. N/A — boundary definition only Equipment boundary — physical equipment limit

The symbol library is not exhaustive. It defines the minimum set required to document the broadcast sync architecture. A full standard would include additional categories for attention mechanisms, embedding layers, training loops, and deployment infrastructure. The symbol set should be extended by the standards body that adopts it, following the ISA model of periodic revision with industry consensus.

→ [1, 2, 4]


Section IV
Tag Numbering Convention

The ISA tag numbering system uses a letter-based convention where the first letter identifies the measured variable, subsequent letters identify the function, and a loop number makes each tag unique. FIC045 is the Flow Indicating Controller in loop 045. The convention has been in use since 1949 and is universally understood by process engineers.

The proposed AI P&ID tag convention adapts this structure directly. The first letter identifies the AI element type, the second identifies its function, and a sequential number makes it unique within the system.

AI P&ID Tag Convention Adapted from ANSI/ISA-5.1 letter-based identification
[ Type ] [ Function ] — [ Loop Number ]
LetterType / First PositionFunction / Second PositionExample Tag
MModelWM-001 (World Model 001)
AAdapterPA-001 (Personal Adapter 001)
IInterpreterI (Indicator)IH-001 (Interpreter Head 001)
CCoverageI (Indicator) · T (Transmitter) · C (Controller)CI-001 (Coverage Indicator 001)
SSyncT (Transmitter) · V (Valve/Gate) · A (Alarm)ST-001 (Sync Transmitter 001)
WWeightT (Transmitter) · I (Indicator)WT-001 (Weight Transmitter 001)
EEnclaveB (Boundary)SE-001 (Secure Enclave 001)
DDeviceB (Boundary)DB-001 (Device Boundary 001)
UUser InterfaceI (Interface)UI-001 (User Interface 001)
XExternal SystemI (Interface) · S (Server)XS-001 (External Sync Server 001)
Example Tag Reads
WM-001World Model, instance 001
APM-001Augmented Personal Model, instance 001
CI-001Coverage Indicator, loop 001 — reads coverage score and displays to user
ST-001Sync Transmitter, loop 001 — monitors sync channel and transmits status
SE-001Secure Enclave boundary, instance 001 — hardware isolation zone containing APM-001 and PA-001
SA-001Sync Alarm, loop 001 — activates when sync has not succeeded within defined interval

The tag numbering convention enables the AI P&ID to connect to a HAZOP register exactly as a process P&ID does. Every row in the HAZOP register references a tag. Every action item in the register references a tag. When the register says "SA-001 — Sync Alarm does not activate when sync interval exceeded," both the register and the drawing refer to the same uniquely identified element. The documentation system is self-consistent.

→ [1, 3]


Section V
The Demonstration Drawing

The following AI P&ID documents the broadcast sync architecture from Intelligence at the Edge of the Cloud using the proposed notation. This is the first AI P&ID produced in this notation. The drawing should be read in conjunction with the HAZOP register in the companion paper — every node in that register corresponds to a tagged element on this drawing.

AI-PID-001 · Broadcast Sync Architecture Rev A · 2026 · J. Horvat
DB-001 · DEVICE BOUNDARY SE-001 · SECURE ENCLAVE XS-001 SYNC SERVER Broadcast only Weights only Anon · <50MB/wk ST 001 Weights + version WM-001 WORLD MODEL Sparse MoE · FL Expert A–N MA 001 Reads inventory Query tokens [W] IH-001 INTERP HEAD Frozen · FO APM-001 AUGMENTED PERSONAL MODEL Frozen Base Shared foundation · Broadcast slow cadence · FL PA-001 Pers. Adapters · FL CI 001 Coverage Indicator · FL Translated tokens Context [P] UI-001 USER INTERFACE Query in · Response + Coverage score out Response USER External SA 001 Alarm if sync late INTELLIGENCE AT THE EDGE Broadcast Sync Architecture Drawing: AI-PID-001 Rev: A · Date: 2026 Author: J. Horvat Standard: AI-PID v1.0 Public Domain · No Rights Reserved Sheet 1 of 1
Primary data flow
Signal / control line
Sync channel
Secure enclave boundary
Device boundary
FL = Fail last state  ·  FO = Fail open  ·  FC = Fail closed

The drawing shows elements that no block diagram of this architecture has previously shown: the sync transmitter ST-001 monitoring the broadcast channel, the sync alarm SA-001 that activates when sync is overdue, the model analyzer MA-001 reading the expert inventory, the coverage indicator CI-001 generating the coverage score signal, and the explicit failure mode annotation on every element. The secure enclave boundary SE-001 is a defined zone, not an implied property.

A process engineer reading this drawing can immediately identify the HAZOP nodes, the instrument loop numbers, and the failure modes. The drawing and the HAZOP register from the companion paper are designed to be used together — every node in that register has a corresponding tagged element on this drawing.

→ [4, 5]


Section VI
Version Control and Change Management

A P&ID without version control is not a P&ID. It is a snapshot. The version control and change management requirements are as important as the notation itself — they are what give the document its legal and operational standing.

In process engineering, every change to a P&ID triggers a Management of Change process. The change is described, the affected nodes are identified, the HAZOP register is updated for affected rows, the change is reviewed by a qualified engineer, and the revision block on the drawing is updated with the revision number, date, description, and responsible engineer. You cannot make an informal change to a P&ID. The document history is part of the document.

Change Management Triggers for AI P&ID — Proposed Requirements
Trigger EventRequired ActionOwner
New model version synced — architecture change Update WM-XXX tag with new version number. Review affected HAZOP rows. Re-check IH version compatibility. Issue new drawing revision. ML Arch.
New deployment context — physical AI, medical, infrastructure Full re-HAZOP of affected nodes. Update failure mode annotations for new consequence profile. New drawing revision with context annotation. Safety Eng.
New external integration added to UI-XXX Add integration to drawing with data flow type and direction. HAZOP Node 6 re-analysis for new interface. Update privacy boundary annotation. Platform Eng.
Secure enclave hardware changed Update SE-XXX annotation with new hardware specification. Re-validate failure mode assumptions for enclave boundary. Security audit triggered. Security Eng.
Adapter training mechanism changed Update PA-XXX with new training description. Re-HAZOP Node 3. Update failure mode annotation if drift or over-training behavior changes. ML Design
Coverage score algorithm changed Update CI-XXX with new algorithm description. Re-HAZOP Node 5 — coverage overconfidence is the highest-severity open risk. Calibration validation required before release. ML Design / QA
Production incident with safety consequence Immediate drawing review. Identify affected tag(s). Determine if drawing was accurate — if not, issue corrective revision. Trigger full HAZOP revalidation for affected nodes. Safety Eng.

The drawing title block on AI-PID-001 shows revision A. Every subsequent change to the architecture that affects the drawing triggers a new revision — B, C, D — with the date, change description, and responsible engineer recorded. The revision history is permanent. You cannot go back and un-document a change.

This is exactly what process engineering requires of its P&IDs, and for exactly the same reason: when something goes wrong, the first question is always "what did the drawing say, and was the system built to the drawing?" That question requires a drawing that was maintained as a controlled document, not a slide that was last updated eighteen months ago.

→ [2, 3, 6]


Section VII
A Path to Standardization

ANSI/ISA-5.1 was first published in 1949 as ISA Recommended Practice RP-5.1. It has been revised multiple times — 1984, 1992, 2009, 2022, 2024 — as the process industries evolved and new technologies required new symbols. The standard did not emerge from a single paper. It emerged from industry consensus over decades, with ISA as the standards body that organized and formalized that consensus.

The same path is available for AI P&ID notation. The symbols and conventions proposed in this paper are a starting point, not a finished standard. What is needed is a standards body willing to take it on, an industry working group willing to develop it, and a critical mass of organizations willing to adopt it.

The most natural home is ISA itself — the International Society of Automation already owns the process P&ID standard and has the expertise and infrastructure to extend it. An AI P&ID standard could be designated ISA-5.5 or ISA-5.X, positioned explicitly as an extension of the existing ISA-5.1 instrumentation and control standards family. The connection to existing standards is a strength — it means process engineers who already use ISA-5.1 can read an AI P&ID with minimal additional training.

Alternatively, IEEE, IEC, or a new AI-specific standards body could adopt the notation. What matters is not which body adopts it but that some body does — that the notation becomes standardized, versioned, and maintained as the technology evolves.

ISA-5.1 has been in use for seventy-five years because someone decided that a common notation was worth the effort of standardizing. AI systems will be deployed for at least as long. The time to define the notation is before the accidents, not after.

→ [1, 2, 7]


References
  1. [1]ANSI/ISA-5.1-2024. Instrumentation and Control — Symbols and Identification. International Society of Automation. — Current version of the P&ID instrumentation symbol standard; tagging convention adapted in this paper.
  2. [2]ISO 10628-2:2012. Diagrams for the Chemical and Petrochemical Industry — Graphical Symbols. International Organization for Standardization. — Process equipment symbol standard; equipment symbol approach adapted for AI elements.
  3. [3]Crawley, F. & Tyler, B. (2015). HAZOP: Guide to Best Practice. 3rd ed. Elsevier. — P&ID as prerequisite document for HAZOP; version control and change management requirements.
  4. [4]Horvat, J. (2026). Intelligence at the Edge of the Cloud: A Local Architecture for the AI Era. Public domain. — Primary architecture paper; AI-PID-001 documents this architecture.
  5. [5]Horvat, J. (2026). Applying HAZOP to AI Systems: What Process Engineering Knows About Failure That AI Safety Doesn't. Public domain. — Companion paper; HAZOP register references tag numbers defined in this drawing.
  6. [6]IEC 61346:1996. Industrial Systems, Installations and Equipment and Industrial Products — Structuring Principles and Reference Designations. — Reference designation standard; grounds the tag numbering convention proposed here.
  7. [7]ISA (2024). ISA Standards and Publications — ISA5.1 Committee. isa.org. — Standards development process and committee structure for potential adoption of AI P&ID notation as ISA-5.X.
Public domain. No rights reserved. Build freely.
Jonathan Horvat · 2026
Paper 06 — Companion Behavioral Observation as a Training Signal
Position Paper · On-Device Learning · Privacy
Behavioral Observation as a Training Signal
A Mechanism for Privacy-Preserving Personal Adapter Calibration Without Supervision, Labels, or Transmitted Data
Jonathan Horvat
2026 · Public domain · No rights reserved

This paper is a companion to: Horvat, J. (2026). Intelligence at the Edge of the Cloud: A Local Architecture for the AI Era. Public domain.

Abstract

The edge AI architecture proposed in the companion paper claims that personal adapter weights train through behavioral observation — without supervised labels, explicit feedback, or transmitted data. This paper closes that specification gap. We define the Behavioral Preference Signal (BPS): a set of implicit interaction events — query reformulation, session continuation depth, dwell time, and follow-up query structure — that together constitute a noisy but aggregate-reliable proxy for user preference. We propose a constrained low-rank gradient update rule derived from Direct Preference Optimization that uses BPS scores as implicit preference labels, apply drift prevention constraints to prevent over-fitting and adversarial manipulation, and define a three-tier validation design. The mechanism is trainable on-device, requires no server communication, produces no supervisory signal that could be intercepted or compelled, and is grounded in established preference learning theory.


Section I
The Specification Gap

The companion architecture paper makes a precise claim: personal adapter weights update through behavioral observation only — no supervised loss, no explicit feedback, no transmitted gradient. The user is never asked to rate anything. The adapters become calibrated through the natural shape of interaction.

That claim is architecturally motivated and privacy-correct. But it is underspecified. What exactly does the system observe? What constitutes a training signal? How does an observation translate into a weight update? What prevents the adapter from drifting toward a bad state if the observations are noisy or adversarially crafted?

These are not rhetorical questions. They are engineering requirements. An architecture that leaves them unanswered is a proposal, not a blueprint. This paper answers them.

The key insight is that behavioral observation is not a new idea — it is the implicit form of the preference learning that explicit methods like RLHF and DPO already do. The difference is who provides the preference signal. In RLHF, a human annotator compares two responses and clicks a button. In the mechanism proposed here, the user's natural interaction behavior provides the same information — not explicitly, not consistently, but reliably in aggregate. The system observes what the user does after receiving a response, and treats that behavior as implicit evidence about whether the response served the user's need.

The preference signal has always been in the interaction. Most systems just never looked for it.

→ [1, 2, 3]


Section II
The Behavioral Preference Signal

The Behavioral Preference Signal (BPS) is a scalar score assigned to each query-response pair based on the user's next observable action. It is computed on-device, immediately after the action occurs, and is never transmitted. It serves as the implicit preference label that drives the adapter update.

The signal is drawn from four observable event types, each with a time window and a score:

Behavioral Preference Signal — Event Types and Scores Computed on-device · Never transmitted
EventWindowBPS ScoreRationale
Query reformulation — user rephrases the same intent < 45 seconds of response −1.0 Unambiguous negative signal. The response did not serve the user's need. The reformulation reveals what the response missed — level, framing, or domain.
Deep continuation — follow-up query builds on response without restating < 90 seconds of response +1.0 Strong positive signal. The user understood the response and extended it. The structure of the follow-up reveals what the response communicated successfully.
Session end after sustained engagement (> 3 exchanges) After final response +0.4 Weak positive. Extended engagement without reformulation suggests the response pattern was broadly useful. Noisy — user may have simply run out of time.
Immediate session termination after response < 15 seconds of response −0.3 Weak negative. Could indicate the response was complete and satisfying, or that the user gave up. Treated as weak negative only when combined with short session history.
No action — user reads response, session pauses > 120 seconds inactivity 0.0 Neutral. Insufficient signal to determine preference. No adapter update triggered.

The BPS is intentionally simple. More sophisticated signals — dwell time, scroll behavior, copy actions — could be incorporated but introduce complexity without proportional benefit. The two primary signals (reformulation and deep continuation) are sufficient to distinguish responses that served the user from those that did not, in aggregate, over many interactions.

Individual BPS observations are noisy. A user might reformulate because they changed their mind, not because the response failed. A user might continue a session despite a mediocre response because the task required it. The mechanism does not need individual observations to be accurate. It needs the aggregate signal over many interactions to be directionally correct — which empirical work on revealed preference theory suggests it will be, given sufficient interaction volume.

→ [2, 4, 5]


Section III
Grounding in Established Theory

The BPS mechanism is not invented from whole cloth. It is a privacy-preserving instantiation of two well-established learning frameworks: Direct Preference Optimization and revealed preference theory from economics.

Direct Preference Optimization

DPO (Rafailov et al., 2023) showed that RLHF's complex reward modeling pipeline can be replaced by a simple binary cross-entropy objective over preference pairs. Given two responses to the same query — one preferred, one dispreferred — DPO directly optimizes the policy to increase the relative probability of the preferred response. No reward model. No RL loop. Just a classification loss over pairs.

The BPS mechanism adapts DPO to the implicit, on-device setting. Instead of human annotators providing preference labels, the system derives implicit preference labels from behavioral events. A query-response pair that triggers deep continuation is the "preferred" response in the DPO sense. A pair that triggers reformulation is the "dispreferred" response. The update rule is a DPO-derived gradient step on the adapter weights — not the full model, only the thin low-rank delta — in the direction that increases the relative probability of continuation-triggering responses over reformulation-triggering ones.

The critical difference from standard DPO is the source of the preference signal and the scope of the update. Standard DPO updates the full model on explicit human labels aggregated at a server. The BPS mechanism updates only adapter weights on implicit behavioral labels observed entirely on-device. The mathematical structure is the same. The privacy properties are categorically different.

Revealed Preference Theory

In economics, revealed preference theory holds that a person's true preferences can be inferred from their choices, even when those choices are not explicitly framed as preference statements. A consumer who repeatedly buys product A over product B reveals a preference for A, regardless of what they say when asked directly.

The BPS is an application of this principle to conversational AI interaction. A user who consistently continues sessions after responses of a certain type reveals a preference for that type, regardless of whether they would articulate it explicitly. The behavioral signal reveals the preference. The adapter learns from what the user does, not what the user says they want.

This grounding matters for two reasons. First, it provides a theoretical basis for why the aggregate signal is reliable even when individual observations are noisy — revealed preferences aggregate in the same direction as true preferences over sufficient data. Second, it connects the mechanism to a 70-year-old literature in economics and decision theory that has validated the approach in many domains.

→ [1, 3, 5, 6]


Section IV
The Update Rule

The update rule specifies how BPS scores translate into adapter weight changes. It has three components: the gradient direction, the magnitude constraint, and the recency weighting.

BPS Adapter Update Rule Constrained low-rank gradient step · On-device only
Δθ = η · clip(∇θ LBPS(θ), δmax) · wt
θ Personal adapter weights (low-rank matrices A and B in LoRA parameterization). The frozen base model is never updated. Only the adapter delta is modified.
η (learning rate) Fixed small learning rate. Recommended: 1×10⁻⁵. Small enough that individual noisy observations cannot cause significant drift. Larger than typical fine-tuning rates to allow meaningful calibration over weeks of interaction.
LBPS(θ) BPS-weighted DPO loss. For each scored event pair (q, r+, r) — where r+ is the continuation-triggering response and r is the reformulation-triggering response — the loss increases the log-ratio of p(r+|q) to p(r|q) under the adapter-modified policy, weighted by the absolute BPS score magnitude.
clip(·, δmax) Gradient clipping with maximum norm δmax = 0.01 × ‖θ‖. Prevents any single update from moving adapter weights more than 1% of their current norm. This is the primary safeguard against adversarial manipulation and catastrophic forgetting.
wt (recency weight) Exponential decay weight: wt = exp(−λ(tnow − tevent)) with λ = 0.1/day. Recent observations contribute more to each update. Events older than 30 days contribute less than 5% of their original weight. Prevents over-indexing on old interaction patterns as the user's needs evolve.
DPO loss: The BPS-weighted DPO loss increases the log-ratio of the preferred response probability to the dispreferred response probability under the adapter-modified policy, regularized by a KL constraint (coefficient β = 0.1) that prevents the adapter from drifting too far from the frozen base model — equivalent to the standard DPO objective with BPS scores as implicit preference weights. Pair construction: BPS events are not always naturally paired. When a deep continuation occurs without a preceding reformulation, the system pairs the continuation response against the base model's response to the same query (computed offline). When a reformulation occurs, the reformulated query and the subsequent response are used to construct the positive pole. The base model response acts as the reference for the KL constraint in the DPO objective, preventing catastrophic drift from the general capability baseline.

Pseudocode — On-device BPS adapter update loop (simplified for clarity)

for each interaction session:
    events = observe_session(query_response_pairs)
    scored = assign_bps(events)              # Table from Section II
    pairs  = construct_preference_pairs(scored)

    for (q, r_pos, r_neg, score) in pairs:
        if abs(score) < threshold:           # Skip near-zero signal
            continue
        grad = compute_bps_dpo_gradient(
            adapter_weights,
            q, r_pos, r_neg,
            base_model_ref,
            beta=0.1                         # KL constraint coefficient
        )
        clipped = clip_by_norm(grad, delta_max)
        w_t = recency_weight(event.timestamp)
        adapter_weights += eta * clipped * w_t

    # Drift check after each session
    if adapter_norm_delta() > 0.15 * initial_norm:
        eta *= 0.5                           # Halve learning rate
        log_warning("Drift threshold approached")

The pseudocode is intentionally implementable. Every variable is specified. The drift check is explicit. The base model reference is included. This is not a sketch — it is a complete algorithm that could be ported to a LoRA training loop on any device that supports gradient computation in the personal adapter layer.

→ [1, 7, 8]


Section V
Drift Prevention and Adversarial Robustness

The BPS mechanism has two vulnerability surfaces: natural drift from noisy signals, and adversarial manipulation by a user deliberately trying to steer the adapter in a particular direction.

Natural Drift

Noisy individual observations accumulate. A sequence of unusual sessions — travel, illness, a deadline — might produce a cluster of atypical behavioral signals that push adapter weights away from the user's normal pattern. The recency decay partially addresses this: old signals fade. But the cumulative drift constraint is the primary safeguard. If adapter weights have moved more than 15% of their initial norm from the base model, the learning rate halves automatically. This acts like a governor — the adapter continues learning but cannot drift far enough from the base model to lose general capability.

The 15% threshold is a design parameter, not a derived constant. A tighter threshold (5–10%) produces a more conservative adapter that calibrates slowly. A looser threshold (20–30%) allows more personalization but increases the risk of capability loss. The right value depends on deployment context and should be empirically validated in the human pilot study described in Section VI.

Adversarial Manipulation

A user who understands the mechanism could deliberately generate reformulation and continuation signals to steer the adapter. This is not a theoretical attack — it is a practical one that a technically sophisticated user could attempt.

The gradient clipping is the primary defense. A maximum norm constraint of 1% of adapter norm per update means that even a deliberate sequence of high-magnitude BPS events cannot move the adapter significantly in a single session. Sustained manipulation over many sessions would be needed to produce meaningful drift — and even then, the 15% cumulative norm constraint limits the achievable manipulation.

A sophisticated adversary with physical device access could in principle attempt sustained manipulation — but this is precisely the threat the hardware secure enclave is designed to address, reducing it from a remote software attack to a local physical one. The more important observation is that the attack surface is narrow. The adapter encodes the user's patterns, not external world knowledge. A user manipulating their own adapter produces an adapter that serves their manipulated preferences — which is arguably the correct behavior. The security concern is not that a user manipulates their own model, but that a malicious actor manipulates another user's model. That attack requires physical access to the device, which is the secure enclave's job to prevent.

→ [7, 8, 9]


Section VI
Validation Design

The BPS mechanism is a proposal, not a proof. Three validation tiers — synthetic, simulation, and human pilot — are required to establish that the mechanism works as described. This section specifies each tier in enough detail that a research team could implement it directly.

Tier 1 Synthetic Validation No humans required

Setup: Generate a dataset of synthetic interaction sequences with known preference structure. Each sequence consists of 50–100 query-response pairs drawn from a preference-consistent distribution — some user type always prefers technical depth, another always prefers brevity, a third prefers analogical reasoning.

Procedure: Run the BPS mechanism on each sequence. Assign BPS scores based on simulated behavioral events generated from the known preference structure. Apply the update rule. After N interactions, measure whether adapter weights have moved in the direction that would increase performance on held-out preference-consistent queries from the same distribution.

Success criterion: Adapter performance on held-out queries improves by at least 5% relative to base model on preference-consistent tasks. Adapter does not degrade on general capability tasks. Drift remains within bounds.

Primary metric: Δ(preference-consistent accuracy) vs Δ(general accuracy) Secondary metric: Adapter norm delta after N interactions Dataset size: 5 synthetic user types × 100 interaction sequences × 50 pairs = 25,000 events
Tier 2 Simulation Validation LLM as simulated user

Setup: Use an existing aligned LLM (GPT-4 class or equivalent) as a simulated user with a defined preference profile. The simulated user has a fixed "true preference" — for example, it prefers responses that use structured enumeration over prose. The preference is known to the evaluator but not encoded in the adapter at the start.

Procedure: Run interactions between the adapter-equipped model and the simulated user. The simulated user generates realistic reformulation or continuation signals based on whether each response matches its known preference profile. Apply BPS updates after each interaction. After 50, 100, and 200 interactions, evaluate whether the adapter model now matches the simulated user's preference profile better than the base model.

Success criterion: At 200 interactions, adapter model response style matches simulated user preference on blind evaluation. Reformulation rate decreases monotonically over interactions. PΔ score (from the IpMW companion paper) reaches at least +0.15.

Primary metric: PΔ at 50 / 100 / 200 interactions Secondary metric: Reformulation rate per 10-interaction window (should decrease) Simulated users: Minimum 3 distinct preference profiles
Tier 3 Human Pilot Study 5–10 participants · 2 weeks

Setup: Recruit 8–12 participants with diverse professional backgrounds and interaction styles. Each participant uses the system as their primary AI assistant for two weeks. The system is split — half of participants use the BPS-updated adapter (treatment group), half use a frozen adapter that receives no updates (control group). Neither group is told which condition they are in.

Procedure: At days 1, 7, and 14, administer a blind preference evaluation: present each participant with 20 matched query-response pairs — one generated by their adapter model, one by the base model — and ask which they prefer without revealing the source. Record reformulation rates throughout the study period. At study end, interview participants about their experience.

Success criterion: Treatment group participants prefer their adapter model responses over base model responses at statistically significant rates by day 14. Treatment group reformulation rate decreases over the study period. Control group shows neither effect.

Primary metric: Blind preference rate (treatment vs control) at day 14 Secondary metric: Reformulation rate slope over study period Sample size: 8–12 participants (power analysis suggests 8 is sufficient for the expected effect size) Ethics: IRB approval required · Participants informed they are evaluating AI personalization · No sensitive data retained Power instrumentation: Device power draw during adapter update cycles should be measured and reported — connecting to the IpMW companion paper metric.

Tier 1 can be completed with no external resources beyond compute. Tier 2 requires API access to a capable LLM. Tier 3 requires ethical approval and participant recruitment. The tiers are designed to be sequential — positive results at each tier justify the investment in the next. A research team with access to Tier 1 validation could publish that result alone as a meaningful contribution to the privacy-preserving personalization literature.

→ [2, 3, 10]


Section VII
Open Questions

This paper specifies the mechanism and validation design but does not close every question. The following open questions are honest gaps — not rhetorical — and should be treated as research directions rather than objections.

01
Cold start — how many interactions before PΔ becomes meaningful?
A freshly deployed adapter with empty weights has PΔ = 0. The mechanism requires interactions to accumulate before calibration begins. The simulation validation will reveal the interaction count required for meaningful PΔ improvement, but this is not yet known. 50 interactions? 200? The answer has significant implications for how useful the system is in the first days of deployment.
02
Preference drift — what happens when the user's genuine preferences change?
The recency decay is designed to allow the adapter to evolve as the user changes. But the optimal decay constant λ is not known. Too fast and the adapter oscillates; too slow and it fails to track genuine preference evolution. The human pilot study will provide data, but the right λ may vary by user and deployment context.
03
Cross-domain generalization — does adapter calibration in one domain transfer to another?
If a user primarily interacts in their professional domain, the adapter calibrates to their professional communication style. Does that calibration transfer when they ask questions in an unfamiliar domain? The LoRA parameterization suggests some generalization should occur, but the extent is unknown and depends on the overlap between domain representations in the base model.
04
Optimal drift threshold — is 15% the right cumulative norm constraint?
The 15% norm constraint is a design choice, not a derived constant. The right threshold balances personalization depth against capability preservation. Too tight and the adapter cannot meaningfully differentiate from the base model. Too loose and capability degradation becomes a real risk. Empirical validation across diverse user types is required to establish this threshold with confidence.

None of these open questions undermine the mechanism. They are the natural boundary of what a position paper can establish without empirical data. The validation design in Section VI is designed precisely to answer questions 01 through 04. The mechanism is specified with enough precision to run that validation today.

→ [2, 3, 10]


References
  1. [1]Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290. — Foundational DPO algorithm; the BPS update rule is a privacy-preserving on-device adaptation of this objective.
  2. [2]Zhao, A. et al. (2025). PrefEval: Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs. ICLR 2025 (Oral). — Demonstrates that current LLMs fail to follow implicit preferences at conversation scale; motivates the adapter-based approach.
  3. [3]Li, J. et al. (2024). Personalized Language Modeling from Personalized Human Feedback (P-RLHF). ICLR 2025. — Closest existing work to BPS mechanism; handles implicit preferences but requires server-side aggregation and gradient transmission. BPS eliminates both requirements.
  4. [4]Samuelson, P. A. (1948). Consumption Theory in Terms of Revealed Preference. Economica, 15(60), 243–253. — Foundational revealed preference theory; grounds the BPS behavioral signal in 70-year-old economic decision theory.
  5. [5]Zollo, F. et al. (2024). PersonalLLM: Tailoring LLMs to Individual Preferences. ICLR 2025. — Benchmark for LLM personalization; establishes that diverse user preferences are learnable from interaction history.
  6. [6]Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. — Low-rank parameterization used for the adapter weights updated by the BPS mechanism.
  7. [7]Kopiczko, D. J. et al. (2025). Low-rank adaptation for edge AI. Scientific Reports. — Edge-specific LoRA deployment; grounds the on-device training loop in production-validated methodology.
  8. [8]Apple Machine Learning Research (2024/2025). Apple Intelligence Foundation Language Models. machinelearning.apple.com. — Production precedent for frozen base + LoRA adapter pattern on consumer hardware; validates the hardware feasibility of on-device adapter updates.
  9. [9]Horvat, J. (2026). Applying HAZOP to AI Systems. Public domain. — HAZOP register for the full architecture identifies adapter over-training (Node 3, More guide word) as a Medium severity risk; the drift prevention mechanism in Section V is the action required for that register entry.
  10. [10]Tummalapalli, P. et al. (2026). LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load. arXiv:2603.23640. — Establishes real-world power and thermal constraints for on-device LLM operations; grounds the on-device training loop feasibility claim.
Public domain. No rights reserved. Build freely.
Jonathan Horvat · 2026