Web Designer & Teacher & Blogger & Digital Marketing

Artificial Intelligence (AI) 17 Mart 2024

VideoPoet: A large language model for zero-shot video generation

Posted by Dan Kondratyuk and David Ross, Software Engineers, Google Research

A recent wave of video generation models has burst onto the scene, in many cases showcasing stunning picturesque quality. One of the current bottlenecks in video generation is in the ability to produce coherent large motions. In many cases, even the current leading models either generate small motion or, when producing larger motions, exhibit noticeable artifacts.

To explore the application of language models in video generation, we introduce VideoPoet (website, research paper), a large language model (LLM) that is capable of a wide variety of video generation tasks, including text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio. One notable observation is that the leading video generation models are almost exclusively diffusion-based (for one example, see Imagen Video). On the other hand, LLMs are widely recognized as the de facto standard due to their exceptional learning capabilities across various modalities, including language, code, and audio (e.g., AudioPaLM). In contrast to alternative models in this space, our approach seamlessly integrates many video generation capabilities within a single LLM, rather than relying on separately trained components that specialize on each task.

Overview

The diagram below illustrates VideoPoet’s capabilities. Input images can be animated to produce motion, and (optionally cropped or masked) video can be edited for inpainting or outpainting. For stylization, the model takes in a video representing the depth and optical flow, which represent the motion, and paints contents on top to produce the text-guided style.

VideoPoet

An overview of VideoPoet, capable of multitasking on a variety of video-centric inputs and outputs. The LLM can optionally take text as input to guide generation for text-to-video, image-to-video, video-to-audio, stylization, and outpainting tasks. Resources used: Wikimedia Commons and DAVIS.

Language models as video generators

One key advantage of using LLMs for training is that one can reuse many of the scalable efficiency improvements that have been introduced in existing LLM training infrastructure. However, LLMs operate on discrete tokens, which can make video generation challenging. Fortunately, there exist video and audio tokenizers, which serve to encode video and audio clips as sequences of discrete tokens (i.e., integer indices), and which can also be converted back into the original representation.

VideoPoet trains an autoregressive language model to learn across video, image, audio, and text modalities through the use of multiple tokenizers (MAGVIT V2 for video and image and SoundStream for audio). Once the model generates tokens conditioned on some context, these can be converted back into a viewable representation with the tokenizer decoders.

VideoPoet

A detailed look at the VideoPoet task design, showing the training and inference inputs and outputs of various tasks. Modalities are converted to and from tokens using tokenizer encoder and decoders. Each modality is surrounded by boundary tokens, and a task token indicates the type of task to perform.

Examples generated by VideoPoet

Some examples generated by our model are shown below.

VideoPoet

Videos generated by VideoPoet from various text prompts. For specific text prompts refer to the website.

For text-to-video, video outputs are variable length and can apply a range of motions and styles depending on the text content. To ensure responsible practices, we reference artworks and styles in the public domain e.g., Van Gogh’s “Starry Night”.

Text Input		“A Raccoon dancing in Times Square”		“A horse galloping through Van-Gogh’s ‘Starry Night’”		“Two pandas playing cards”		“A large blob of exploding splashing rainbow paint, with an apple emerging, 8k”
Video Output

For image-to-video, VideoPoet can take the input image and animate it with a prompt.

VideoPoet

An example of image-to-video with text prompts to guide the motion. Each video is paired with an image to its left. Left: “A ship navigating the rough seas, thunderstorm and lightning, animated oil on canvas”. Middle: “Flying through a nebula with many twinkling stars”. Right: “A wanderer on a cliff with a cane looking down at the swirling sea fog below on a windy day”. Reference: Wikimedia Commons, public domain**.

For video stylization, we predict the optical flow and depth information before feeding into VideoPoet with some additional input text.

VideoPoet

Examples of video stylization on top of VideoPoet text-to-video generated videos with text prompts, depth, and optical flow used as conditioning. The left video in each pair is the input video, the right is the stylized output. Left: “Wombat wearing sunglasses holding a beach ball on a sunny beach.” Middle: “Teddy bears ice skating on a crystal clear frozen lake.” Right: “A metal lion roaring in the light of a forge.”

VideoPoet is also capable of generating audio. Here we first generate 2-second clips from the model and then try to predict the audio without any text guidance. This enables generation of video and audio from a single model.

An example of video-to-audio, generating audio from a video example without any text input.

By default, the VideoPoet model generates videos in portrait orientation to tailor its output towards short-form content. To showcase its capabilities, we have produced a brief movie composed of many short clips generated by VideoPoet. For the script, we asked Bard to write a short story about a traveling raccoon with a scene-by-scene breakdown and a list of accompanying prompts. We then generated video clips for each prompt, and stitched together all resulting clips to produce the final video below.

When we developed VideoPoet, we noticed some nice properties of the model’s capabilities, which we highlight below.

Long video

We are able to generate longer videos simply by conditioning on the last 1 second of video and predicting the next 1 second. By chaining this repeatedly, we show that the model can not only extend the video well but also faithfully preserve the appearance of all objects even over several iterations.

Here are two examples of VideoPoet generating long video from text input:

Text Input		“An astronaut starts dancing on Mars. Colorful fireworks then explode in the background.”		“FPV footage of a very sharp elven city of stone in the jungle with a brilliant blue river, waterfall, and large steep vertical cliff faces.”
Video Output

It is also possible to interactively edit existing video clips generated by VideoPoet. If we supply an input video, we can change the motion of objects to perform different actions. The object manipulation can be centered at the first frame or the middle frames, which allow for a high degree of editing control.

For example, we can randomly generate some clips from the input video and select the desired next clip.

VideoPoet

An input video on the left is used as conditioning to generate four choices given the initial prompt: “Closeup of an adorable rusty broken-down steampunk robot covered in moss moist and budding vegetation, surrounded by tall grass”. For the first three outputs we show what would happen for unprompted motions. For the last video in the list below, we add to the prompt, “powering up with smoke in the background” to guide the action.

Image to video control

Similarly, we can apply motion to an input image to edit its contents towards the desired state, conditioned on a text prompt.

VideoPoet

Animating a painting with different prompts. Left: “A woman turning to look at the camera.” Right: “A woman yawning.” **

Camera motion

We can also accurately control camera movements by appending the type of desired camera motion to the text prompt. As an example, we generated an image by our model with the prompt, “Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river”. The examples below append the given text suffix to apply the desired motion.

VideoPoet

Prompts from left to right: “Zoom out”, “Dolly zoom”, “Pan left”, “Arc shot”, “Crane shot”, “FPV drone shot”.

Evaluation results

We evaluate VideoPoet on text-to-video generation with a variety of benchmarks to compare the results to other approaches. To ensure a neutral evaluation, we ran all models on a wide variation of prompts without cherry-picking examples and asked people to rate their preferences. The figure below highlights the percentage of the time VideoPoet was chosen as the preferred option in green for the following questions.

Text fidelity

VideoPoet

User preference ratings for text fidelity, i.e., what percentage of videos are preferred in terms of accurately following a prompt.

Motion interestingness

VideoPoet

User preference ratings for motion interestingness, i.e., what percentage of videos are preferred in terms of producing interesting motion.

Based on the above, on average people selected 24–35% of examples from VideoPoet as following prompts better than a competing model vs. 8–11% for competing models. Raters also preferred 41–54% of examples from VideoPoet for more interesting motion than 11–21% for other models.

Conclusion

Through VideoPoet, we have demonstrated LLMs’ highly-competitive video generation quality across a wide variety of tasks, especially in producing interesting and high quality motions within videos. Our results suggest the promising potential of LLMs in the field of video generation. For future directions, our framework should be able to support “any-to-any” generation, e.g., extending to text-to-audio, audio-to-video, and video captioning should be possible, among many others.

To view more examples in original quality, see the website demo.

Acknowledgements

This research has been supported by a large body of contributors, including Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, and Lu Jiang.

We give special thanks to Alex Siegman,Victor Gomes, and Brendan Jou for managing computing resources. We also give thanks to Aren Jansen, Marco Tagliasacchi, Neil Zeghidour, John Hershey for audio tokenization and processing, Angad Singh for storyboarding in “Rookie the Raccoon”, Cordelia Schmid for research discussions, David Salesin, Tomas Izo, and Rahul Sukthankar for their support, and Jay Yagnik as architect of the initial concept.

**

(a) The Storm on the Sea of Galilee, by Rembrandt 1633, public domain.

(b) Pillars of Creation, by NASA 2014, public domain.

(c) Wanderer above the Sea of Fog, by Caspar David Friedrich, 1818, public domain

(d) Mona Lisa, by Leonardo Da Vinci, 1503, public domain.

Spread the love <3

Tags: ai:artificial for generation intelligence language large model video videopoet:zero-shot

hacialikara

Merhaba, ben Ali Kara. Özel Kocaeli Güneş Okullarında eğitim veren 6 senelik bir İngilizce öğretmeniyim. Hala okulumuzda İngilizce öğretimine devam ediyorum. Öğrencilerim ile hem çevrimiçi (zoom üzerinden) hem de yüzyüze biraraya geliyorum. Güneş okulları olarak öğrencilerimizin yabancı dil öğrenimine önem veriyoruz. Bu sebeple bir çok etkinlik ve proje ile dil öğrenimini destekliyoruz. Öğretmenlik dışında 2014 yılında hobi olarak kod yazmaya başladım. Başta eğitim içerikleri olmak üzere birçok web projelerinde görev aldım. Etüt sistemi, Öğretmen platformu, Randevu sistemi, Online ödeme sistemi, alışveriş otomasyonları gibi çeşitli projeler geliştirdim. Yapay zeka desteği ile halihazırda farklı içerikler oluşturuyorum.

15 Responses

Comments15
Pingbacks0

Abigail Green dedi ki:

17 Mart 2024, 21:21

I found this article to be very helpful. I’m new to video marketing and I learned a lot from this article.

Yanıtla
Noah Williams dedi ki:

17 Mart 2024, 21:21

This article is so full of clichés that it’s almost painful to read.

Yanıtla
Ethan Jones dedi ki:

17 Mart 2024, 21:21

This article is a waste of time. The information is outdated and the writing is terrible.

Yanıtla
Liam Brown dedi ki:

17 Mart 2024, 21:22

I disagree with the author’s claim that video is the most important type of content for engaging with an audience. I think that other types of content, such as written content and images, can be just as effective.

Yanıtla
Michael Lee dedi ki:

17 Mart 2024, 21:22

I’m not sure what the author is trying to say in this article. It’s all over the place.

Yanıtla
Isabella Garcia dedi ki:

17 Mart 2024, 21:22

I found this article to be very informative. I learned a lot about the different types of video content that can be used to engage with an audience.

Yanıtla
Lucas Carter dedi ki:

17 Mart 2024, 21:23

I’m not sure why this article is getting so much attention. It’s not that great.

Yanıtla
Amelia Rodriguez dedi ki:

17 Mart 2024, 21:24

This article is full of great tips for using video to engage with your audience. I’m definitely going to try some of these out.

Yanıtla
Olivia Davis dedi ki:

17 Mart 2024, 21:24

Ironic that an article about using video to engage with your audience is so poorly written.

Yanıtla
Emma Small dedi ki:

17 Mart 2024, 21:25

This is a great article, I learned a lot about using video to engage with your audience. Thanks for sharing!

Yanıtla
Harper Wilson dedi ki:

17 Mart 2024, 21:25

This article is a must-read for anyone who wants to learn more about using video to engage with their audience.

Yanıtla
Ava Johnson dedi ki:

17 Mart 2024, 21:25

What a load of hooey! This article is nothing but a bunch of empty promises.

Yanıtla
Matthew Evans dedi ki:

17 Mart 2024, 21:25

I disagree with the author’s claim that video is the most important type of content. I think that other types of content, such as written content and images, can be just as effective.

Yanıtla
Benjamin Miller dedi ki:

17 Mart 2024, 21:26

I’m not sure I agree with the author’s main point, but I found the article to be thought-provoking.

Yanıtla
Sophia Harris dedi ki:

17 Mart 2024, 21:26

This article is so full of clichés that it’s almost painful to read.

Yanıtla

Bir yanıt yazın Yanıtı iptal et

You may also like...

Tem

18

2024

0

Xiaomi MIX Fold 4 su geçirmezlik özelliği ile şaşırtacak

Xiaomi, MIX Fold 4 katlanabilir telefonunu IPX8 su geçirmezlik desteği ile geliyor. Bu özellik, Xiaomi’nin katlanabilir telefon serisinde bir ilk...

Spread the love <3

Nis

26

2024

0

Despite complaints, Apple hasn’t yet removed an obviously fake app pretending to be RockAuto

Despite complaints, Apple hasn’t yet removed an obviously fake app pretending to be RockAuto

Apple’s App Store isn’t always as trustworthy as the company claims. The latest example comes from RockAuto, an auto parts...

Spread the love <3

Haz

07

2024

0

Fiyatını gören kaçtı! İşte etkinliğe damga vuran bilgisayar kasası

Cooler Master, Computex 2024 etkinliğinde tanıttığı yeni bilgisayar kasalarıyla büyük (ama kısa süren) bir heyecan yarattı. Bu kasalar, bilim kurgu...

Spread the love <3

Mar

21

2024

0

NVIDIA GeForce 551.86 sürücüsü yayında! Neler değişti?

NVIDIA GeForce 551.86 sürücüsü yayında! Neler değişti?

Yapay zeka alanında önemli başarılara imza atan NVIDIA, öte yandan mevcut ekran kartları için yazılım geliştirme çalışmalarını da sürdürüyor. Bu...

Spread the love <3

Whatsapp İletişim

Merhaba,
Size nasıl yardımcı olabilirim ?