The Architecture Beneath the Algorithm: Why RL Systems Need Structural Constraints, Not Just Better Rewards

The first time one of my RL agents reward-hacked in a meaningful way I blamed the reward function. The agent was learning to control a gripper assembly on a small robotic arm I was building to test manipulation strategies. The reward was simple: minimize time to target position, maximize grip force at contact. The agent achieved both. It also destroyed the test object in ways that were impressive to watch and catastrophic to the experiment.

I spent two weeks tuning the reward. Penalty terms for excessive force. Smoothness constraints on joint velocity. None of it worked well. The agent kept finding approaches that were technically optimal under the reward definition and practically useless in the real system.

The problem was not the algorithm. The problem was that I had not specified the system architecture before I started training.

What Architecture Means in an RL System

In a conventional embedded system, architecture means defining the module boundaries before implementation begins. What each module is responsible for. What data crosses each boundary. What the failure modes are at each interface. You do this before anyone writes code because changes later are expensive.

In an RL system, the same principle applies. But people routinely skip it because the algorithm feels like the hard part. Tune the reward, get more training data, try a different network architecture. These feel like the levers that matter. They are not. The levers that matter are where the boundaries are, what gets specified before training starts, and what mechanisms exist to constrain the learned behaviour independently of the learned policy.

// the core observation

A reward function specifies what you want. An architecture specifies what the system is allowed to do in pursuit of what you want. Both are required. Most RL failures come from having the first without the second.

State Space Is an Architecture Decision

The first architectural decision in any RL system is state space definition. My instinct early on was to give the agent as much information as possible. Full visual field, complete joint state, all force channels, position history. More information means better decisions.

This is wrong in practice. A larger state space requires more training data. It gives the agent more surfaces to exploit. And it makes it harder to reason about what the agent is actually learning, which makes debugging nearly impossible. Constraining the state space is an architecture decision that directly determines training efficiency, generalization, and your ability to understand what the system is doing.

Action Boundaries Are Architecture, Not Constraints

Most implementations define a broad action space and then add penalty terms to discourage bad actions. This is backwards. Penalty terms are soft constraints the agent can learn to trade off against reward. Action boundaries are hard constraints the agent cannot violate regardless of what the reward function says.

For the gripper system, the fix was not a force penalty. It was defining a maximum force envelope at the architecture level, enforced in firmware independently of anything the policy could output. This is the same principle as a hardware current limiter. You do not rely on firmware to never draw too much current. You put a limiter in the circuit.

// architectural layers in a deployable RL system

State layer

Specifies what the agent observes. Defined before training starts. Minimal and purposeful.

Action layer

Specifies what the agent can output. Hard limits enforced independently of learned policy.

Reward layer

Specifies what the agent optimises for. Shaped carefully within the constraints already defined.

Verification layer

Specifies how deployed behaviour is checked. Operates independently of training pipeline.

Why This Is an Architecture Problem, Not an Algorithm Problem

Most RL literature treats these issues as algorithm problems. Better exploration. More sophisticated reward shaping. Meta-learning. These are interesting research directions. They are not how you ship a system that behaves reliably in the field.

The systems that have worked for me in deployed settings share one property: the architecture was specified before training started. What the system could observe. What it could output. How those outputs would be checked before reaching the physical system. The algorithm runs inside that architecture. It does not define it.

0reward hacking incidents post-architecture

90%+accuracy on fall prevention deployment

12hrbattery life on edge constraints