TLDR:
OpenAI just showed the capabilities of a new text 2 video tool that can generate videos of up to 60 seconds, and it is incredible, it's 10x better what we had before.
Long Story:
As predicted this year will be all about text2video. We've learned how to make high quality pictures (midjourney is just on another level), now we're expanding our capabilities and pushing our hardware to the next level.
This week, OpenAI presented their fresh out of the oven, tool called Sora, and frankly it's just mind-blowing 🤯. Why? Simply because the quality of the video it's producing is just unbelievable. The shapes, the lines, the motion, the fingers – are still delusional, but other than that it's really good.
It is currently in a closed alpha, where only some influencers have access to it. If you share the word about me, I might be one of them at some point in life 🤞.
Anyways, the tech is still the same, take a text, make a picture, and now generate another (60seconds * 60pictures per second) 3600pictures, put them one after another and voilá you have a video. The thing is though, it's sooo much smoother than what Pika does (check this post for context). It has a story line, it does look like a video, not just a weird cartoon. There are either some clever algorithms behind the scene, either AI has a budget of $7 Trillion, and can afford more computing power – regardless it makes a massive difference.
Based on the numbers alone, we can assume that video generation is going to be expensive and slow. I believe the tech will revert to what we've had about 50 years ago, first we make a 200x200px video check it in low res, to make sure we've got the right result, and then just upscale it like crazy to 4k.
Check the demos from Open AI's website.
Prompt: A beautiful silhouette animation shows a wolf howling at the moon, feeling lonely, until it finds its pack.
Prompt: A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.
Well aren't they cute? Awww 🥰.
It still has it's flaws, like physics are often all over the place, it sometimes lacks context, it doesn't really know how to interact with objects, as in if in the generated video, somebody bites a cookie, the bite won't disappear, the cookie will still be brand new. Just take a look at the examples bellow:
Prompt: Five gray wolf pups frolicking and chasing each other around a remote gravel road, surrounded by grass. The pups run and leap, chasing each other, and nipping at each other, playing.
Prompt: Basketball through hoop then explodes.Prompt: Basketball through hoop then explodes.
You can clearly see, that is has problems with physics and the way it interacts with objects. It still tries to copy existing videos into reality.
It is clearly flawed, but that's just the beginning. It's only 1 year old. Just think what's it going to be able to do by the age of 5.