AI video generators can’t understand the laws of physics solely by watching videos, scientists have found.
Coming hot on the heels of chatbots and image generators, AI video generators like Sora and Runway have already been delivering impressive results. But a team of scientists from Bytedance Research, Tsinghua University, and Technion were curious to learn if such models could discover physical laws from visual data without any additional human input.
While in the real world, we understand physics through math, in the world of video generation, an AI model that understands physics should be able to watch a sequence of frames and then predict which ones come next. This should happen both when the images are ones the AI model has seen before and also unfamiliar ones.
To find out whether this understanding exists, the scientists created a 2D simulation using simple shapes and movements and created hundreds of thousands of mini videos for their model to train and be tested on. They found that the models could 'mimic' physics but not understand it.
Is SORA really a world model? - YouTube
The three fundamental physical laws for simulation they chose to study were the uniform linear motion of a ball, the perfectly elastic collision between two balls, and the parabolic motion of a ball.
Based on the team's pre-print paper, it turned out that while the shapes acted as they should for simulations based on the data they were trained on, they failed to act properly in new, unforeseen scenarios. At best, the models tried to mimic the closest training example they could find.
During the course of their experiments, the scientists also observed that the video generator often changed one shape into another (e.g. a square randomly turns into a ball) or made other nonsensical adjustments. The model's priorities appeared to follow a clear hierarchy, with color holding the highest importance, followed by size, and then velocity. Shape received the least emphasis.
Have they found a solution?
“It is challenging to determine whether a video model has learned a law instead of merely memorizing the data,” the researchers said. They explained that since the model’s internal knowledge is inaccessible, they could only infer the model’s understanding by examining its predictions on unseen scenarios.
“Our in-depth analysis suggests that video model generalization relies more on referencing similar training examples rather than learning universal rules,” they said, highlighting this happens regardless of the amount of data a model trains on.
Have they found a solution? Not yet, lead author Bingyi Kang wrote on X. “Actually, this is probably the mission of the whole AI community,” he added.
More from Tom's Guide
- 5 Best AI video generators — tested and compared
- AI glossary: all the key terms explained including LLM, models, tokens and chatbots
- Meet Mochi-1 — the latest free and open-source AI video model