RoboticoRobotico
Physics First: Why FieldAI's Foundation Models Break the VLA Playbook

Physics First: Why FieldAI's Foundation Models Break the VLA Playbook

Physical Intelligence retrofits vision-language models. Skild AI scales imitation learning. FieldAI took a third road, one that starts from the physics of being wrong, and treats risk as a first-class citizen of the model.

Ben knaus
5m read
Upvote this article

The race to build a general purpose robot brain has created three major architectural bets. Physical Intelligence is adapting pre-trained vision language models for robotic action. Skild AI is training one massive general model across different robot types using imitation learning. Field AI chose neither path. That difference matters because the real question in robotics is not how smart the robot looks when everything goes right. It is what happens when things go wrong.

The VLA shortcut; and its cost

Most robotics foundation models today start with a vision language model. The process is simple: take a transformer trained on internet images and text, add an action layer, and ask it to move the robot. That is the basic playbook behind Pi-Zero, OpenVLA, Google DeepMind’s RT-2, NVIDIA’s GR00T, and even Figure’s Helix. It helps teams move quickly, but it comes with a major problem. These models were never built for physics. They were trained to describe the world, not survive inside it.

They have no natural understanding of risk, uncertainty, or physical consequence. A vision language model hallucinating is annoying. A robot hallucinating is expensive. A robot stepping into hydraulic fluid is expensive. A robot stepping into a worker is a lawsuit. That is the gap.

FFMs start where physics starts

FieldAI’s Field Foundation Models (FFMs) were built physics-first. “We look at AI quite differently from what’s mainstream,” Agha told IEEE Spectrum. “We do very heavy probabilistic modeling.” In the FFM world, every perception is a distribution, every action is weighted against risk, and every decision carries an explicit measure of confidence.

The stack is decomposed into three tightly integrated models:

  • Dynamics Foundation Model (DFM), handles the physical behavior of the robot itself. How a step propagates through a quadruped’s joints, how a humanoid recovers from a near-fall, how a wheeled base behaves on loose gravel. DFM is what catches a slip before it becomes a crash.
  • Multi-Agent Foundation Model (MFM), coordinates multiple robots operating in the same space. Fleet-level task allocation, collision avoidance, and shared world understanding. Critical for construction sites and warehouses where several machines move simultaneously.
  • Safety and Risk Awareness layer, a probabilistic overlay that converts every perception and action into a risk-weighted distribution. This is where FieldAI’s Belief World Model lives: a predictive engine that reasons about uncertainty, distributes belief across possible futures, and selects behavior that remains safe under the worst plausible outcome.

Inference runs on the edge at sub-100ms latency. No cloud dependency. Cloud-based analytics and federated learning exist only for long term refinement; nothing the robot needs to make its next step has to round-trip through the internet.

Probability distributions, not fixed values

FieldAI’s first patent gives one of the clearest looks into how this works. Instead of deciding yes or no, the robot thinks in distributions. It does not simply decide “I can step here.” It calculates that there may be a 94 percent chance the step works, a 5 percent chance of slipping, and a 1 percent chance the surface is worse than it appears.

That becomes the decision.

If failure is recoverable, the robot moves forward. If failure is catastrophic, it stops.

This is how aviation works. This is how medicine works. This is how serious systems work. FieldAI compressed that logic into something fast enough to run on a robot in real time.

The three practical capabilities

In deployment, FFMs unlock three behaviors that the VLA approach struggles with:

  1. GPS-denied, map-less navigation. The robot builds its world model as a byproduct of moving through it, not as a prerequisite. This is why Spot running the FieldAI Brain can walk onto a construction site that changes daily without anyone re-uploading a floor plan.
  2. Risk-aware decision-making. When confidence drops, the system slows, re-perceives, or backs off, rather than confabulating. A VLA faced with an ambiguous scene will often produce a confidently wrong action. A FFM faced with the same scene produces a low-confidence output, and the safety layer translates low confidence into conservative behavior.
  3. Cross-embodiment transfer. Because the model reasons about physics, forces, torques, traversability, rather than about a specific motor configuration, the same core runs on a quadruped, a humanoid, a forklift, or a passenger vehicle. FieldAI has publicly demonstrated all four.

What operators actually see

Agha explains the customer experience in the simplest possible way: “Our customers don’t need to train anything. They don’t need precise maps. They press one button, and the robot discovers every corner.”

That is what customers actually buy.

They are not buying foundation models or technical architecture. They are buying simplicity.

“Scan this floor.” “Inspect this pipeline.” “Check this jobsite.”

The robot handles the rest.

That is the operational unlock, and it removes weeks of setup that traditionally came before deploying even a single inspection robot.

The open question

The real question is whether physics first scales as well as brute force imitation learning.

Skild AI has more capital. Physical Intelligence has more compute. If the winner is simply whoever trains on the most data, FieldAI could lose.

But if reliability becomes the bottleneck, especially in safety critical environments, the equation changes.

Early signs suggest it does.

Because eventually robotics stops being a demo and becomes infrastructure. Infrastructure cannot hallucinate.

If the VLA approach is the GPT moment for robotics, powerful, creative, and sometimes confidently wrong, then FieldAI is making a different argument.

Robotics does not need more confidence.

It needs better judgment.

Related Analysis