Sora Unveiled: How Far Are We from It?

Artificial intelligence company OpenAI announced the launch of Sora, a large model for text-to-video generation.

In the early hours of February 16th, Beijing time, OpenAI, a leader in global artificial intelligence models and a pioneer of the AI era, introduced a model capable of instantly generating short videos based on text commands, naming it Sora. The emergence of Sora has left the tech world amazed. Compared to previous generative AI-produced animation content, Sora showcases a whole new level of visual imagery that is jaw-dropping. The showcased visuals, including lighting effects and details, have left people in awe, earning Sora the title of the “AI version of the magic brush of Ma Liang.”

While Sora’s capabilities mark a significant advancement, it may not be as straightforward as initially imagined. Several industry experts told Southern Plus reporters that the current version of Sora is still evolving, thus exhibiting some “unreliable” aspects. However, there is no doubt that AI’s pace will continue to accelerate.

From Text to Image

”AI’s a significant step, but not a breakthrough yet.”

Hu Guoqing, the head of the 5G project group at the Beijing-Shenzhen Research Institute and director of the Guangdong Provincial Frontier Technology Research Institute, believes that based on the current samples officially released by Sora, its ability to directly generate images from text commands can indeed achieve a realistic effect to some extent. This represents a significant advancement for artists, filmmakers, and others involved in video production. Moreover, compared to previous AI products, this is undoubtedly a huge leap forward.

However, the ability to generate a 60-second short video from text is already achievable by other models, although they might be limited to just a few seconds.

”It’s premature to call this a breakthrough. According to the videos released by OpenAI, AI’s technique of generating images frame by frame from text commands and then concatenating them to form a video provides a good idea for other models to train from text to images.”

At the same time, some of Sora’s “weaknesses” have been disclosed by the official sources. Yao Jun, a specialist engineer at Tencent’s Machine Learning Platform department, explained that because its model does not rely on an internal physical simulation engine, the generated videos often appear “unreliable,” showing inconsistencies with the laws of real-world physics. This is a problem inherent in the current technical approach driven by large-scale data-driven models.

Yao believes that the current application scenarios for this application are still relatively limited. “From a theoretical perspective, these models do not have a world model, a real knowledge framework core. They only rely on the ‘law of large numbers’ reflected in the data, which overlaps with the real world to some extent but falls far short of the threshold of a ‘world model.‘”

AGI Implementation May Be Shortened to a Year?

”Be cautious, but the time may significantly shorten.”

In response to the discussions surrounding Sora, Zhou Hongyi, the founder of 360, expressed his views on social media, even suggesting that the birth of Sora means that the realization of AGI (Artificial General Intelligence) may be shortened from 10 years to one or two years.

Regarding Sora’s biggest advantage, Zhou Hongyi stated that previous text-to-video software only operated on 2D planes to manipulate graphic elements, viewing videos as combinations of multiple real images and lacking a true understanding of the world. But in the videos produced by Sora, it can understand, like humans, that tanks have tremendous impact force and can destroy cars, without showing situations like cars destroying tanks. “Once artificial intelligence is connected to a camera, watching all the movies, watching all the videos on YouTube and TikTok, the understanding of the world will far exceed that of textual learning. A picture is worth a thousand words, and the information conveyed by videos far exceeds that of a picture, so AGI is not far away. It’s not a matter of 10 or 20 years, it may be achieved quickly in one or two years.”

However, industry experts told Southern Plus reporters that while they believe that the pace of AI development will accelerate, whether AGI can be achieved within a year still needs to be viewed cautiously. Hu Guoqing said that although OpenAI’s president did mention focusing on AGI development in 2024, whether it can be achieved in a year remains unknown. “After the release of Sora, I believe that companies like Google will quickly follow suit. It is expected that various companies will launch similar beta versions this year. The more competition among companies, the faster the maturity of this field.”

Regarding when the public can use such products on a large scale, Yao Jun told reporters, “It’s expected to be soon.” Yao said that without seeing serious papers, based solely on feelings, Sora seems to have absorbed a lot of experience from large language models and image-generating models. It has partially solved the constraints of training data and reportedly used video data generated by game engines. Additionally, as rumors suggest, the model’s scale is not large, so its achievements are expected to be quickly applied.

But one thing is certain: the optimization speed of this model will only get faster and faster. Just like the appearance of text-to-image technology at the time, with an upgrade every quarter, there will be a significant change every year.

Will the Film and Television Industry Be Impacted by AI?

Industry insiders: AI generation costs low, but “somewhat fake”

Due to Sora’s visual capabilities, people cannot help but wonder if the AI industry will impact the film and television industry. In response, reporters contacted a person in charge of a film and television company in Beijing, who, under the pseudonym Xin Yi, believed that while the image quality and content brought by Sora are stunning, she is not optimistic about its direct involvement in the film and television production process.

”Purely from the perspective of image quality, most of the videos presented by Sora are impressive in terms of clarity and visual details. However, compared to mainstream film and television works today, they are still far apart.” Xin Yi