Robot reasoning: why data is not enough

robot reasoning why data is not enough

Robots aren’t just software. They’re AI in the physical world. And that changes everything.

In this episode of TechFirst, host John Koetsier sits down with Ali Farhadi, CEO of the Allen Institute for AI, to unpack one of the biggest debates in robotics today: is data enough, or do robots need structured reasoning to truly understand the world?

Ali explains why physical AI demands more than massive datasets, how concepts like reasoning in space and time differ from language-based chain-of-thought, and why transparency is essential for safety, trust, and human–robot collaboration. We dive deep into MolmoAct, an open model designed to make robot decision-making visible, steerable, and auditable, and talk about why open research may be the fastest path to scalable robotics.

And, watch our conversation here:

This conversation also explores:

  • Why reasoning looks different in the physical world
  • How robots can project intent before acting
  • The limits of “data-only” approaches
  • Trust, safety, and transparency in real-world robotics
  • Edge vs cloud AI for physical systems
  • Why open-source models matter for global AI progress

Here’s the full transcript …

Transcript: robot reasoning, and why data is not enough

Note: this is a partially AI-generated transcript. It may not be 100% correct. Check the video for exact quotations.

John Koetsier

One of the biggest questions in robotics right now is Plato versus Aristotle. And no, I’m not even joking. Hello and welcome to Tech First. My name is John Koetsier. Robots are AI in the real world — physical AI, if you will. To make that real you need some means of sensing, understanding the world, your relationship with it, how it moves, how you move, and how that all reacts.

It’s a complex problem — extremely complex. Most robotics companies try to solve it with data: more, more, more, more data. That’s Aristotelian empiricism, if you will.

One company likes data, but also wants to use structured reasoning to actually understand the world, map its way through it, and make that logic and those maps visible to us. That’s Platonic rationalism. Could this be the key to scalable, generalizable robotics models?

We’re going to find out. We’re chatting with the head of the Allen Institute for AI. The Allen Institute was founded by Microsoft co-founder Paul Allen in 2014 to fund science at scale, and AI is a major part of that. He’s a prominent computer vision researcher, professor at the University of Washington, former Apple alum, and his name is Ali Farhadi. Welcome, Ali. How are you doing?

Ali Farhadi

Hello, hello. Great talking to you.

John Koetsier

Awesome. Super pumped to have you. How awful was my Plato and Aristotle analogy?

Ali Farhadi

You have your finger on the right spot. These are all the important questions that we need to start thinking about seriously. We do believe that AI in the physical world, in whatever shape and morphology it appears, is going to play a pivotal role in how we work, how we live, how we interact, and how we move around. All of these are great examples of what it means for AI to hit the physical world.

When it does, the kinds of questions that we ask might be a little different from the ones we ask about AI in text or images. Reasoning — a topic that all of us are fascinated with in AI and work a lot on — has a different manifestation when we start thinking about reasoning in the physical world.

We often think of reasoning as solving math problems, coding problems, or tests. In the physical world, reasoning comes in many different forms and requires subtle but crucial things to be considered as we act in the world. The subtle movement of a hand, a subtle movement of the eye, the roll of an eyebrow — all of these convey a significant amount of meaning in the physical world that we need to take into account as we participate in it.

Human beings are really good at projecting into the future. If a person in front of me extends their hand, I know what they’re doing. I mentally prepare for that. I even move my body and position myself accordingly. I approach someone differently if they open both of their arms. That’s a totally different reaction.

Something as simple as opening a door — we position ourselves so the door doesn’t hit us and our hand can reach the handle. Before approaching the door, I already know: what kind of door is this? How can I open it? What’s the implication of opening it? What’s going to happen after I open it?

So there’s a fair amount of understanding of what the world does, the dynamics of the world, other agents in the world, what they’re doing, understanding their perspective, understanding why they’re doing what they’re doing, and planning for that.

Many of these pieces have long been questions within robotics, computer vision, and embodied AI communities. We’ve moved the needle, but there’s still a lot to be covered. I do agree the next revolution will be AI in the physical world. We need to reason beyond the standard ways we reason in AI today.

Data is an absolute must-have. We have to have data to be able to act. But also structured reasoning — figuring out and understanding what it means to reason about space and time, understanding perspective-taking — all of these are components of this. Some will come from data. Some might need different forms. As a community, we’re still scratching our heads trying to understand all the pieces.

John Koetsier

It’s super interesting. I had to think of multiple intelligences as you were talking. Humans have intelligence around science and math, around emotional things, and around physical things.

You had an amazing football game in your city just last night — Seahawks and Rams — and physical intelligence was on high display, along with mental intelligence. You released MolmoAct this summer. It’s an action reasoning model for robots. You’re using training data and getting amazing results, but you’re adding this reasoning component. Talk about that. What is MolmoAct? What does it do? Why is it important?

Ali Farhadi

MolmoAct is part of a series of open-source models we’re deploying to empower communities working on multimodal worlds, embodied AI, and robotics. MolmoAct specifically addresses one problem: if I want to act in the physical world, I need a way to reason.

In text, we have chain-of-thought — laying out steps to solve a problem. In robotics, this gets complicated because language isn’t the right manifestation for describing the physical world. I can’t realistically say, “Now I want to move my hand from this X, Y, Z position to that one.”

What was missing was a physical equivalent of chain-of-thought. MolmoAct takes a step in that direction by producing trajectories — curves in state space — as a step-by-step reasoning process. These trajectories act like a chain-of-thought for physical actions.

Just like reasoning in text, you explore a path, realize it’s not right, backtrack, and try another. MolmoAct builds on our earlier Molmo work, where we introduced pointing as a way to ground reasoning. If you’re making inferences about an image, you should be able to point to where those inferences come from.

That idea became popular in the community, especially for grounded reasoning and robotics. MolmoAct extends pointing into trajectories in the real world — a sequence of actions a robot hand should take to achieve a goal.

John Koetsier

That’s really important as robots start interacting with humans. We want them to plan actions without hurting people, and we want to rewind and understand why they did something. That ability to diagnose and repair faulty reasoning is critical, right?

Ali Farhadi

Absolutely. You read my mind.

Think about interacting with a child. You ask them to grab a mug, and there are two mugs on the table. They reach for one, and you realize it’s not the one you meant. You can correct them mid-action: “Not that one — the blue one.” You understand their intent before the action finishes.

That level of transparency is essential for robotics. As physical agents operate in the real world, humans need to understand what’s about to happen before it happens. That gives us assurance, understanding of intent, and the ability to plan.

Transparency is also essential for safety, auditability, and certification. These agents have mass and momentum. We need trust. End users — homes, factories, streets, airspace — all need transparency.

The third piece is steerability. As we interact more with robots, we want to dynamically guide them. To do that, we need to understand their intent and have a way to change it. With MolmoAct, you can see a trajectory and adjust it mid-way — on a tablet, for example — and the robot adapts. That’s the “not the blue cup, the red cup” scenario.

John Koetsier

I love that you used the word trust. Trust comes from predictability. When I can predict what a robot is going to do and those predictions come true, I’m not surprised by random motions.

What’s broken with a data-only approach? Data has given us a lot, especially with LLMs, but it’s not the whole story.

Ali Farhadi

I’m not a fan of framing this as data-only versus reasoning. Even reasoning relies on data. Data is foundational.

The difference is how data is ingested. One approach is end-to-end black boxes: input goes in, output comes out. Another introduces intermediate representations — what we might call reasoning — explicit structures, loss functions, or trajectories that expose inner workings.

I don’t think we’re at a point where we can say which approach will win, and I don’t think it’s an either-or. Data-driven approaches work. Scaling data works. We’re seeing it help in the physical world too — but it won’t solve everything.

We need to identify gaps and fill them. Explicit representations that enable transparency — and earn trust — are a promising middle ground.

John Koetsier

This feels similar to the AGI debate. Data has taken us far, but people like Yann LeCun argue it’s not the full answer. You’re bringing that thinking to physical AI. You open-sourced the model — why?

Ali Farhadi

We’re where we are in AI because of one phenomenon: global collaboration. We build, share, critique, and build again. That incremental process has been the most effective way to innovate.

As models close up, we slow progress. Many trained experts are now sidelined because they don’t have tens of thousands of GPUs or millions of dollars. That hurts all of us.

We wanted to enable participation — to give the community building blocks to train, understand, interrogate, and extend these models. Early on, people said it was too expensive for anyone else to do this. That story changed. Then came the “AI is too dangerous” story. That changed too.

The fastest, most economical, and most trustworthy approach is openness. Siloed progress is always slower than collective progress. This even ties into the US versus China debate — open collaboration accelerates innovation.

John Koetsier

What’s next? What other blocks need to be built?

Ali Farhadi

Trust still needs to be earned. That comes from predictability, understanding, and auditability. Even if I can’t fully audit a system, a trusted third party can — and that adds value.

John Koetsier

What classes of robots can use this? Humanoids, quadrupeds, wheeled robots?

Ali Farhadi

Absolutely. MolmoAct applies across morphologies. We’ve also released Molmo 2, which extends these ideas into video and temporal reasoning — understanding what happened before, what happens next, and reasoning across space and time.

You can now point in time as well as space. Track objects, ask questions about long videos, and reason across multiple images. There’s still a lot to do in spatial-temporal reasoning, manipulation, and scaling robotics.

Robotics scaling is different from other AI models because of the physical world. Simulation plays a big role. Large data helps. Inference-time thinking — adapting dynamically in the real world — will be critical.

We’ll need lighter, faster, more efficient models.

John Koetsier

Perfect segue to edge versus cloud. Latency, power, and privacy matter. You don’t want a five-second delay asking a cloud model what happens if you move your arm. How do you see this playing out?

Ali Farhadi

Because of latency, power, and privacy constraints, many robots need to operate on the edge. Some tasks will still go to the cloud in a privacy-preserving, trusted way. Hybrid architectures will take many forms, and we’re still exploring what works best.

Chips will get more powerful and efficient. Algorithms will get lighter and faster. We’ll likely move toward specialized on-device models rather than large generic ones.

John Koetsier

That makes sense — kind of like LLMs versus SLMs. Small, efficient models that do exactly what they need to do.

This has been super interesting and informative. Thank you so much for your time.

Ali Farhadi

Absolutely. Great time with you.

Subscribe to my Substack