Details
Finally Sora is out, we’ve been teased with it, with all the hype around it, with all the information and finally it’s out. Anybody can try it for a price, and anybody can finally become a cinematographer, all you need is access to the internet.
Now, a thing that makes Sora stand out from the competition is that Openai went way ahead and built a completely new UI for this product. The chat box is now replaced with a story board, where you can create composite videos and connect these together to create a timeline for your video.
As you can see from the picture above, you have a basic interface that allows you to arrange clips in between, each clip as an input box where you can add some text and as such creating a continuous timeline.
For each box you can add a base picture or video for the AI to understand the basic principle you have there. If you simply access their homepage you’ll already see a lot of real life examples generated by users.
I believe this is one of the slickets UI we currently have in the industry for video generation. Note there is no sound with the video, so probably that’s going to be the next big thing that Sora will have at some point in time. I say that but there’s virtually no interest from OpenAI to do this, yet if I were OpenAI that would be exactly the next thing I would do to make this tool complete.
Now back to the UI, in the bottom bar we have a bunch of controls to allow us to do our experimentation.
First input you can select the video aspect ratio, which comes in 3 options: horizontal, square and vertical.
Next we can select a resolution of the image, by default it’s 480p which isn’t really great but good enough to understand if this is the direction you want. Note that higher resolution requires way more time.
You can also select duration, amount of variations and some instagram like presets that will make your video look better. Lastly we have a help that informs you about the cost incurred to generate the current video.
Here are a couple of examples generated:
Prompt: A realistic full red feathered owl stands majestically in the center of the frame, its vibrant red feathers glistening under the soft lighting. The owl's powerful muscles are visible as it stands with its closed open. The background is a subtle gradient to keep the focus on the owl.
Scene1: A panoramic view of the mountains in the Philippines as a powerful typhoon rages. Dark, menacing clouds swirl above, casting a shadow over the lush, green peaks. Heavy rain pours down in torrents, creating waterfalls that cascade down the mountainsides. The wind howls through the valleys, bending the trees and whipping the vegetation fiercely. Occasional flashes of lightning illuminate the scene, adding to the dramatic atmosphere. The landscape is rugged and wild, emphasizing the raw power of nature in full force.
Scene 2: The rain intensifies, and the wind grows stronger, with trees bending further under the pressure.
Prompt: A man in a sleek black suit holding a bouquet walks along a cobblestone path at sunrise. The camera follows his silhouette closely from behind, capturing the warm golden light of the early morning with cinematic softness.
Now the videos look really impressive, but you can see that sometimes the AI mixes up the depth and continuity of the movement. Especially in the last example, you can clearly see that something’s off, but can’t really figure out what, well at a frame in time, you can see that the legs are mixed-up and they are somehow magically switched positions.
That leads us to the main culprit the AI still has. It doesn’t perceive physics, or depth. It just compares and creates images shallowly without any depth to it. That said, it is really hard to make this happen, because doing this you’d need to make a simulation first, over the attempt to simply generate videos from images. In the end what we’re trying to achieve is simulation of a real world, wrapped into a specific scenario.
Let’s say we want to show a video of “a falling apple onto the ground”, the AI will do what it did so far, it learned unconsciously by looking at millions of examples of what happens to an apple or any object when it falls down, and it will simulate the same action, as a video. Now if we want to add something out of the ordinary, the AI doesn’t really know how to do that as it tries to recreate just from things it saw, instead of thinking what would the consequences of gravity or other external factors would do to the apple. To address that we need to make the AI learn consciously. Instead of learning as a baby by simply absorbing information and replicating it, we need to convert it into a teenager that learns things to understand how it works.
Now this is really hard to do, as it requires that we really have Artificial Consciousness. Alternatively, which sounds more credible to me, is to give the AI a playground with simulations, where it could simulate actions and consequences of 3d worlds, and I think Nvidia will have an enormous role at, because 3d worlds require tons of mathematical computation for the physics engines to work properly, and Nvidia produces the best hardware for that yet.
But I’m no genius, so probably there’ll be a clever person that will figure out the smartest and optimistic way to achieve this.
To conclude I want to say that, where we are today, is miles ahead of where we were a year ago. I remember just around christmas last year I was speaking about Pika 1.0 which was already super cool, today tho, the results are simply incredible. Despite the drawbacks, the current version already allows you to achieve incredible results, and you already can be a movie producer!
What a time to be alive! Welcome to the future.
P.S: There are no google examples, because google didn't show anything it just announced it. That's it. Thanks for that google.