World models: LLMs are not enough

world models LLMs

AI has mastered language, sort of. Mostly. With some occasional hiccups. But the real world is way messier. For the real world — and robotics — we need world models.

In this episode of TechFirst, John Koetsier sits down with Kirin Sinha, founder and CEO of Illumix, to explore what comes after large language models: world models, spatial intelligence, and physical AI.

They unpack why LLMs alone won’t get us to human-level intelligence, what it actually takes for machines to understand physical space, and how technologies born in augmented reality are now powering robotics, wearables, and real-world AI systems.

Check out our conversation here:

This conversation goes deep on:

  • What “world models” really are — and why everyone from Fei-Fei Li to Jeff Bezos is betting on them
  • Why continuous video and outward-facing cameras are so hard for AI
  • The perception stack behind robots and smart glasses
  • Edge vs cloud compute — and why latency and privacy matter more than ever
  • How AR laid the groundwork for the next generation of physical intelligence

If you’re building or betting on robotics, smart wearables, AR, or physical AI, this episode explains the infrastructure shift that’s already underway.

Transcript: world models and LLMs

Note: this is a partially AI-generated transcript. It may not be 100% correct. Check the video for exact quotations.

John Koetsier

We are totally raising the bar on what it means to call matter smart. Hello and welcome to TechFirst. My name is John Koetsier. I’ve been talking and writing about smart matter for a long time. Now, that stuff with chips, radios, sensors, motors—there’s more and more of it every single day. But what we used to call smart isn’t that impressive anymore. You can’t just slap a Wi-Fi radio or Bluetooth into something. That doesn’t cut it.

Now we’re talking about spatial intelligence. We’re talking about physical AI. It requires much more.

We’re talking world models. We’re talking not just standard LLMs, but AIs that can ingest continuous streaming video, which is crazy hard, and then do things intelligently in the real physical world where we also exist—the human world.

Our guest today has been a Wall Street quant. She has not one but two master’s degrees, one in machine learning. She’s the founder and CEO of Illumix, which makes augmented reality tech for companies like Disney and Six Flags, but is now also turning to robotics, wearables—everything physical. Welcome, Kirin. How are you doing?

Kirin Sinha

Hi, thanks so much for having me.

John Koetsier

Super pumped to have this conversation. It is so, so topical right now. I want to start off with a big question. We’ll go where we want to go after that. What’s a world model? Why does it matter?

Kirin Sinha

World models have been a big part of the conversation lately. I think just to back up a second and say, obviously, LLMs have been the huge topic of conversation over the past two years. And really what we’ve seen them do incredibly well is that they can basically understand and output text very well, around something that feels like human-level intelligence. And that is a really sharp departure from what we saw with deep learning or early AI models.

That’s led to all of the incredible applications we see today. And I think what we’re now coming across is: where are the boundaries of where that can really take us? We’ve seen very recently Yann LeCun, who’s been the godfather of deep learning, come out and say LLMs are never going to get us to AGI. This is never going to be what human intelligence looks like. And that’s caused quite a stir.

We have some pretty major players—everything from Fei-Fei Li with World Labs all the way over to Jeff Bezos with Prometheus—who are looking at this next wave, which is really world models.

John Koetsier

Mm-hmm.

Kirin Sinha

World models are more focused on physical space—ingesting and simulating the real-world environment. You can see obvious applications for this. For example, with gaming, how can we create these highly expansive, complex worlds in a whole different way with a single continuous model? I think that’s really interesting.

Robotics is the other big one you tend to hear about. How can we simulate all of these different environments so that robots can learn more effectively how to operate in the real world?

So there’s this argument that going more into physics, functionally, over language—which is where LLMs live—is the current wave du jour around AI.

For what we do, which isn’t explicitly world models, I would say if we think about LLMs as language and world models as physical simulation, what we do at Illumix is almost like the runtime. How do we actually make these AI models operate in the real world?

That’s the last-mile question—moving away from simulation, moving away from purely digital environments, which LLMs and world models functionally live in, and into how we actually bring AI into the real world in a way that becomes closer to human-level intelligence. We as humans operate and apply our intelligence to the real world all day, every day, but those are really hard problems. That’s what Illumix has been focused on at the core stack level for the last eight and a half years.

John Koetsier

It’s kind of insane, actually. You’ve done a ton in AR, augmented reality, XR—all that stuff, right? Some of that is game-related, some of that is real-world related. And that was such a big deal—the metaverse, Meta investing billions and tens of billions.

And then the focus really shifted. I can tell you’re in San Francisco—I hear the siren. That brings me right back every time I’m in meetings there.

Meta invested huge there, and now the rage in the last two years has been AI in the digital realm, as you mentioned. But it’s amazing how what has been is coming around again, because if we want AI to have real-world impact—whether that’s robotics, humanoid robotics, whatever—anything in your space, in your workspace, in an office, in your home, it needs to understand a world.

And guess what? Augmented reality does that. Augmented reality understands that there’s a thing, and it’s this far in front of me, and there’s another thing, and it’s that far away. There’s a floor, a plane, a surface, a ceiling. All of those things so I can understand a space that I’m operating in and operate in it safely.

It’s kind of crazy and interesting that many of the investments we put into that space are actually bearing unexpected fruit now.

Kirin Sinha

Absolutely. I think it’s important to separate foundational infrastructure that understands space from the way we engage with it. AR is a way of engaging with that infrastructure—adding digital content to physical space.

Then there’s the wider VR and metaverse umbrella, which is really about living fully in a digital space. I’ve always felt that combining all of this under one “XR” or “metaverse” blanket is too broad, because they’re actually solving different problems with different outcomes.

The idea of the metaverse as a game universe, or even what VR is trying to solve—where we are the thing that’s real and everything else is digital—is a really different use case. It has different constraints and different opportunities than real-world applications.

That’s where historically AR has focused on understanding space and building foundational blocks to add digital elements into our world. The common example is Pokémon Go—the Pokémon are in our world—but how do we take that beyond “I know my GPS coordinate” and actually understand the scene?

When we think about this, we break it down into three categories. First is spatial perception: what are the things around us? What’s the geometry of the space? What’s the depth? How do we understand the 3D elements?

Humans are incredibly good at nuance here. Lighting changes, a kid throws a blanket on a chair, a new chair appears. Humans instantly know it’s the same space. For computers, that’s actually really challenging.

Second is scene understanding. What’s actually happening in the space? What’s the story we can tell about it? It’s not just geometry. We’re in a library, books are strewn out, so someone was probably researching something. That kind of semantic understanding.

Third is contextual intelligence. Okay, we understand what’s happening—what should we do about it? What do we know about you? What should we store in memory? How do we personalize the right information for this scene?

You could be in that library for many reasons. You could be cleaning. You could be researching. How do we know the right action to take once we understand the space? That infrastructure layer is really where we’ve dug in.

John Koetsier

It’s amazing to me. One of the phrases you threw out as we were prepping was “the camera has flipped,” and that continuous video inputs are hard for machines. It’s crazy because we do that naturally as humans. We know what to focus on, what’s important in a given context, what’s not.

We can focus our relatively small CPUs on what matters. But for a robot—or smart glasses—those are incredibly hard problems.

Kirin Sinha

They really are. A lot of big tech companies have historically focused on the selfie-facing camera because most of them are social media companies. That’s a much more contained problem. Faces have common features—eyes, nose, mouth—that you can anchor on.

Understanding the outward-facing world is much harder because it could literally be anything. That’s a big, unruly problem.

We’ve had much more research and deployment around the selfie-facing model, and now we’re seeing this camera flip where everyone is interested in the outward-facing camera. For physical AI, robotics, wearables—that’s now the most interesting problem.

Historically, there was a lot of reliance on hardware solutions. A few years ago it was, “Let’s add LiDAR.” Autonomous vehicles, mobile devices—add LiDAR and that will solve it. But we’re seeing that adding more hardware is always challenging, especially for lightweight wearables.

Even Tesla has moved away from LiDAR toward camera-based systems. Cameras give you more information at a more efficient rate than other modalities. That’s a big industry shift.

John Koetsier

I want to talk about the perception stack. Robots need a brain. Wearables need to understand the world. What’s in a perception stack? How much intelligence is local? How much is in the cloud?

You never want a device to say, “Hold on, I’m querying ChatGPT,” and then ten seconds later you get an answer. We hate waiting on screens. We hate waiting even more in the real world.

Kirin Sinha

That’s the most important architectural question today: where should compute happen?

Different companies have different opinions. One area where we’ve differentiated is deciding what absolutely must happen on device.

Take visual positioning systems—knowing where you are. Historically, that was done in the cloud. “Give me a few seconds, let me figure it out.” That causes a lot of the jerkiness you see in AR, where things float or drift because corrections are coming back from the cloud.

One of the first things we did was build that architecture to run at the edge—hyper-efficiently, but with robustness comparable to large cloud models. That set our philosophy: things that matter for real-time user experience should run on device.

What’s become popular—even with LLMs—is “bigger is better.” Huge models, massive compute, everything in the cloud. But I think we’ll see diminishing returns there, especially for physical devices.

Distribution comes down to what’s lighter, cheaper, and easier to ship. The more we can do on device with a small footprint, the more valuable the device becomes.

John Koetsier

It’s similar to LLMs versus smaller language models. You see humanoid robots trying to put massive NVIDIA GPUs inside, but you can’t do that with wearables. You have to be hyper-efficient.

Ultimately, it’s all about user experience. Fast, smooth, immediate. So how do you do that? Do you selectively run parts of the stack?

Kirin Sinha

That’s the DNA of the company. Achieving equivalent accuracy and quality to large models, but doing it efficiently at the edge.

Part of it is knowing what to run and when. You don’t want every part of the perception stack running full blast all the time. When do you need mapping? When do you need semantics? When do you need memory or a multimodal model?

That orchestration—having things work together efficiently—is one of the hardest problems.

John Koetsier

I love that analogy. When I’m playing ice hockey, I’m not doing calculus. Different parts of the brain activate. What’s the controller that decides what matters right now?

Kirin Sinha

That’s exactly the infrastructure we’re building. Understanding user intent is incredibly hard.

We use both on-device and cloud compute. Everything on device has a real-time target. Other things—ambient intelligence, long-term memory—can run in the cloud.

For wearables, the question is: what does “always on” really mean? You don’t want continuous high-power processing on device. That makes no sense.

Instead, we send passive data to the cloud, build memory, understand relevance, and only pull things down to the device when they matter in real time.

It’s similar to human short-term and long-term memory. You’re not constantly processing everything you’ve ever learned—only what’s relevant now.

Humans are incredibly good at spatial intelligence. That’s what our brains evolved to do. There’s no clear path today for LLMs or even world models to fully replicate that.

As we talk about AGI, that last-mile problem will be about infrastructure and runtime—bringing intelligence into physical reality.

John Koetsier

Will we need new chips for humanoids? Apple-style chips with multiple cores, different power levels, software controlling what’s active?

Kirin Sinha

Yes. Chips will evolve toward that model. Different devices will prioritize different power distributions depending on use case.

We’ll see more custom chips specialized for physical AI. The architecture may be similar, but how power is allocated will vary by industry and application.

John Koetsier

I just realized I treat some smart devices like dogs—spelling out words so they don’t react.

This has been super interesting. Whether it’s smart glasses, wearables, or robots, this is a new paradigm. Cameras looking at the world, not just at us. Privacy concerns, latency concerns—things we’ve been discussing for over a decade.

Kirin Sinha

Exactly. If everything goes to the cloud, even ignoring latency, you have real privacy issues. If processing happens locally, you mitigate many of those concerns.

We need to think architecturally about where data goes, who touches it, and who sees it.

John Koetsier

Thank you so much for this time, Kirin. I really appreciate your insights.

Kirin Sinha

Absolutely. Thanks so much for having me, John.

John Koetsier

Cool.

Subscribe to my Substack