Following the launch of ChatGPT, at the beginning of the Year of the Dragon, OpenAI has released its latest artificial intelligence (AI) model—Sora, a text-to-video program. This tool can generate realistic and imaginative videos similar to Hollywood movies based on simple text descriptions.
An article on the website of the British magazine “New Scientist” on February 17th pointed out that the arrival of Sora may be “both loved and feared” by people. Many scientists welcome its arrival, believing that it will further advance technological progress. However, some people are concerned that Sora may be used by malicious individuals to create deepfake videos, exacerbating the spread of errors and false information.
Two powerful features incubated by technology
Sora, which means “sky” in Japanese, was chosen by the team behind it because it “evokes infinite creative potential.” The system is a recent example of generative AI. Generative AI can create text, images, and sound in real time.
Currently, Sora can be used independently with text commands or combined with images to create 60-second videos, all in a single shot. For example, a demonstration video is generated based on the following text description: a fashionable woman strolling along a Tokyo street filled with city signs, with neon lights flashing on both sides, emitting a warm glow. Sora not only accurately presents details but also generates characters with rich emotions.
Currently, there are 48 videos available on the OpenAI website, including a dog frolicking in the snow, vehicles traveling on the road, and even more fantastical scenes such as sharks swimming between city skyscrapers. Some experts believe that Sora’s performance surpasses that of similar models, marking a significant leap forward in text-to-video technology.
To achieve a higher level of realism, Sora combines two different AI technologies. One is similar to the diffusion models used in AI image generators such as DALL-E, which learn to transform random image pixels into coherent images; the other is the “transformer architecture” technology, which is used to concatenate sequence data based on contextual content. For example, large language models use transformer architecture to assemble scattered words into understandable sentences. OpenAI decomposes video clips into visual “spatio-temporal patches,” and Sora’s transformer architecture handles these patches.
Van Linxi, a senior research scientist at NVIDIA, called Sora a “data-driven physics engine” that can simulate the real world.
There is still much room for improvement
Although the videos generated by Sora are impressive, they are not without flaws.
OpenAI admits that the current Sora model also has weaknesses. It may be difficult to accurately simulate the physical properties of complex scenes, and it may not understand causality. For example, the system recently generated a video of a person eating a cookie, but no matter how the cookie was eaten, it did not get smaller, and the bitten cookie miraculously had no bite marks. In addition, the model may confuse spatial details of text prompts and may have difficulty accurately describing events that occur over time.
Avand Narayanan of Princeton University pointed out that Sora-generated videos still encounter some strange minor issues when depicting complex scenes with a lot of action.
May make it difficult to distinguish between truth and falsehood
In addition to speeding up the work of experienced filmmakers, Sora may also quickly and cheaply produce false information on the internet, making it harder for people to distinguish between truth and falsehood.
OpenAI is still trying to understand the dangers of Sora, so the system has not yet been released to the public. Instead, they are sharing this technology with a few scholars and other external researchers, hoping to leverage their wisdom to identify ways in which the system may be misused.
In the “red team” exercise conducted by OpenAI for Sora, experts attempt to undermine the AI model’s protective measures to assess the likelihood of its misuse. An OpenAI spokesperson stated that those currently involved in testing Sora are “experts in areas such as misinformation, hate speech, and bias.”
This testing is crucial because Sora could be used by malicious actors to generate false videos to harass others or even influence political elections. Academia, industry, government, and AI experts are all concerned that AI-generated “deepfake” content could lead to the widespread dissemination of false and erroneous information.
Hany Farid of the University of California, Berkeley, believes that, like other technologies in the field of generative AI, there is reason to believe that text-to-video technology will continue to improve. Once Sora is combined with AI-driven voice cloning technology, it will provide malicious actors with new tools to create realistic deepfake content, making it increasingly difficult for people to distinguish between truth and falsehood.
OpenAI has added watermarks to the videos generated by the system to indicate that they were generated by AI. However, the company also acknowledges that these watermarks can be removed and may be difficult to detect.
An OpenAI spokesperson emphasized that before using Sora for OpenAI’s products, the company is taking several important security measures. For example, the company has implemented automated processes aimed at preventing its commercial AI models from generating false content targeting politicians and celebrities.