Despite its powerful features, OpenAI’s Sora faces certain issues in simulating complex physical phenomena, understanding specific causal relationships, handling spatial details, and accurately describing events that change over time.
In the video generated by Sora, we can observe a high level of coherence in the overall picture, with excellent performance in terms of image quality, details, lighting, and color. However, upon closer examination, we may notice some distortion in the legs of the characters and discrepancies in the movement compared to the overall tone of the scene.
In this video, we can see that the number of dogs is increasing. Although the transition process is very smooth, it may have deviated from our initial requirements for this video.
(1) Inaccurate Simulation of Physical Interactions:
Sora’s model is not accurate enough in simulating basic physical interactions, such as glass breaking. This may be due to a lack of sufficient examples of such physical events in the training data, or the model’s inability to fully learn and understand the underlying principles of these complex physical processes.
(2) Incorrect Representation of Object State Changes:
When simulating interactions involving significant changes in object states, such as eating food, Sora may not always accurately reflect the changes. This suggests limitations in the model’s understanding and prediction of dynamic processes of object state changes.
(3) Inconsistency in Long Video Samples:
When generating long-duration video samples, Sora may produce inconsistent plots or details, possibly due to difficulties in maintaining contextual consistency over long time spans.
(4) Sudden Appearance of Objects:
Objects may appear inexplicably in the video, indicating that the model still needs improvement in understanding spatial and temporal continuity.
Introducing the concept of “world model”
What is a world model? Let me give you an example.
In your “memory,” you know the weight of a cup of coffee. So when you want to pick up a cup of coffee, your brain accurately “predicts” how much force to use. As a result, the cup is lifted smoothly. You don’t even realize it. But what if the cup happens to be empty? You would exert a lot of force to pick up a very light cup. Your hand immediately senses that something is wrong. Then, your “memory” adds a new rule: cups can also be empty. So, the next time you “predict,” you won’t be wrong. The more things you do, the more complex world models your brain forms to predict the reactions of the world more accurately. This is how humans interact with the world: through world models.
Videos generated by Sora don’t always leave a mark every time. It “sometimes” makes mistakes too. But this is already remarkable and terrifying. Because “remember first, then predict,” this way of understanding the world is how humans understand the world. This mode of thinking is called: world model.
There is a sentence in Sora’s technical documentation:
Our results suggest that scaling video generation models is a promising path towards building general-purpose simulators of the physical world.
Translated, it means:
Our results suggest that scaling video generation models is a promising path towards building general-purpose simulators of the physical world.
This implies that OpenAI’s ultimate goal is not just to create a “text-to-video” tool, but a universal “physical world simulator.” It’s about modeling the real world.