Three ways AI is learning to understand the physical world

Large-scale linguistic models run into limitations in domains that need to understand the physical world — from robotics to automated driving to manufacturing. That factor pressured investors into global models, with AMI Labs raising $1.03 billion in seed funding shortly after World Labs raised $1 billion.

Large-scale linguistic models (LLMs) excel at processing ambiguous information by predicting the next sign, but they lack a causal basis in reality. They cannot reliably predict the physical consequences of real-world actions.

AI researchers and thought leaders are increasingly talking about these limitations as the industry tries to move AI out of web browsers and into physical environments. In an interview with podcast host Dwarkesh Patel, Turing Award recipient Richard Sutton warned that LLMs simply imitate what people say instead of modeling the world, which limits their ability to learn from experience and adapt to changes in the world.

This is why models based on LLMs, including visual language models (VLMs), can exhibit robust behavior and break down with very small changes to their inputs.

Google DeepMind CEO Demis Hassabis echoed this view in another interview, stating that today’s AI models suffer from “hard intelligence.” They can solve math olympiads but fail basic physics because they lack important skills about real-world dynamics.

To solve this problem, researchers are shifting focus to building real-world models that act as internal simulators, allowing AI systems to safely test hypotheses before taking physical action. However, “world models” is an umbrella term that covers several different architectural approaches.

That produced three different architectural approaches, each with different tradeoffs.

JEPA: designed for real time

The first major approach focuses on learning hidden representations instead of trying to predict global dynamics at the pixel level. Endorsed by AMI labs, this approach is largely based on the Joint Embedding Predictive Architecture (JEPA).

JEPA models try to simulate the way people understand the world. When we look at the world, we memorize every single pixel or insignificant detail in the scene. For example, if you watch a car moving down the road, you track its trajectory and speed; you don’t count the direct reflection of light from all the leaves of the trees in the background.

IV-JEPA architecture (source: Meta FAIR)

JEPA models reproduce this human shortcut. Instead of forcing the neural network to correctly predict what the next frame of the video will look like, the model learns a small set of invisible, or “hidden,” features. It discards irrelevant details and focuses entirely on the core rules of how the elements in the scene interact. This makes the model robust against background noise and small changes that break other models.

This architecture is computationally efficient and memory efficient. By ignoring irrelevant information, it requires very few training examples and operates with very low latency. These features make it ideal for applications where efficiency and real-time information are non-negotiable, such as robotics, self-driving cars, and high-value business workflows.

For example, AMI has partnered with healthcare company Nabla to use this architecture to simulate operational complexity and reduce mental workload in fast-moving healthcare settings.

Yann LeCun, a pioneer of the JEPA architecture and founder of AMI, explained that JEPA-based world models are designed to be “manageable in the sense that you can give them goals, and by building, the only thing they can do is to achieve those goals” in an interview with Newsweek.

Gaussian splats: built for space

The second approach relies on generative models to create complete geographic areas from scratch. Adopted by companies such as World Labs, this method takes an initial input (either an image or a text description) and uses a generative model to create a 3D Gaussian splat. Gaussian splat is a technique for representing 3D scenes using millions of tiny, mathematical particles that define geometry and light. Unlike flat video production, these 3D presentations can be imported directly into standard physics and 3D engines, such as Unreal Engine, where users and other AI agents can freely navigate and interact with them from any angle.

The main benefit here is the significant reduction in time and one-time production costs required to create interactive 3D environments. It directly addresses the problem identified by World Labs founder Fei-Fei Li, who noted that LLMs are ultimately like “wordsmiths in the dark,” with bright language but lack of local intelligence and physical knowledge. World Labs’ Marble model gives AI that lost spatial understanding.

Although this method is not designed for split-second, real-time use, it has great potential for spatial computing, interactive entertainment, industrial design, and building static training environments for robots. The business value is evident in Autodesk’s extensive support for World Labs to integrate these models into their industrial design programs.

Last generation: built to scale

The third approach uses an end-to-end generative model to process user commands and actions, continuously generating space, performance, and responsiveness on the fly. Rather than exporting a static 3D file to an external physics engine, the model itself acts as the engine. It ingests raw data around the user’s ongoing action, and generates the next frames of the environment in real time, calculating physics, lighting, and object reactions natively.

DeepMind’s Genie 3 and Nvidia’s Cosmos fall into this category. These models provide a very simple interface for generating unlimited interactive experiences with large volumes of synthetic data. DeepMind demonstrated this natively with Genie 3, showing how the model maintains solid object stability and consistent physics at 24 frames per second without relying on a separate memory module.

This approach translates directly to data-heavy manufacturing industries. Nvidia Cosmos uses this architecture to scale artificial data and physical AI thinking, allowing developers of autonomous vehicles and robots to simulate unusual, dangerous situations without the cost or risk of physical testing. Waymo (partner of Alphabet subsidiary) built its world model on top of Genie 3, modified it of training its self-driving cars.

The disadvantage of this end-to-end production method is the high computational cost required to render the physics and pixels simultaneously. Still, investment is needed to achieve the vision laid out by Hassabis, who says that a deep, internal understanding of physical causality is necessary because current AI lacks critical skills to operate safely in the real world.

Next: integrated structures

LLMs will continue to serve as an interface for thinking and communication, but global models are positioned as the underlying infrastructure for physical and spatial data pipelines. As the underlying models develop, we see the emergence of hybrid architectures that draw on the strengths of each approach.

For example, cybersecurity startup DeepTempo recently developed LogLM, a model that combines features from LLMs and JEPA to detect anomalies and cyber threats from security and network logs.