Google 10-second video generation model VideoPoet breaks the world record! LLM terminates the diffusion model, the effect of rolling top flow Gen-2 10/17 Update SLTechnology News&Howtos

Google 10-second video generation model VideoPoet breaks the world record! LLM terminates the diffusion model, the effect of rolling top flow Gen-2

2025-10-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

Google's new video generation model VideoPoet leads the world again! Ten-second ultra-long video generation effect crushing Gen-2, but also for audio generation, style transformation. AI video generation may be the next juan area in 2024.

Looking back in the past few months, RunWay's Gen-2, Pika Lab's Pika 1.0, domestic manufacturers and other large-wave video generation models have emerged one after another, constantly iterating and upgrading.

In this case, RunWay announced early in the morning that Gen-2 supports text-to-voice, which can create a voiceover for video.

Of course, Google didn't want to lag behind in video generation, first releasing W.A.L.T with the Stanford Li Feifei team, and attracting a lot of attention with the realistic videos generated by Transformer.

Today, the Google team released a new video generation model, VideoPoet, which can generate videos without specific data.

The most amazing thing about https://blog.research.google/ 2023 VideoPoet 12 / videopoet-large-language-model-for-zero.html is that VideoPoet can generate 10 seconds long and continuous large action video at a time, completely crushing the video generation of Gen-2 with only small actions.

In addition, different from the leading model, VideoPoet is not based on the diffusion model, but a multi-modal large model, which can have the capabilities of T2V, V2A and so on, or will become the mainstream of video generation in the future.

After watching it, netizens were "shocked" to brush the screen one after another.

Well, we can take a look at the experience first.

In the text-to-video conversion, the length of the generated video is variable and can show a variety of actions and styles according to the text content.

For example, pandas play cards:

Two pandas playing cards pumpkin explosion:

A pumpkin exploding, slow motion astronaut riding a Mercedes Benz:

An astronaut riding a galloping horse image to video VideoPoet can also convert the input image into animation according to a given prompt.

Left: a ship sailing on a rough sea, surrounded by thunder and lightning, shown in a dynamic oil painting: flying over a nebula full of twinkling stars, right: a traveler standing on a walking stick stands on the edge of a cliff, gazing at the wind-tossing sea fog video stylized for video stylization, VideoPoet predicts light flow and depth information before entering additional text into the model.

Left: wombat wearing sunglasses, holding a beach ball on a sunny beach: Teddy bear skating on clear ice, right: a metal lion roars in the light of a furnace

From left to right: realistic, digital art, pencil art, ink, double exposure, 360-degree panoramic video to audio VideoPoet can also generate audio.

As follows, first generate a 2-second animation clip from the model, and then try to predict the audio without any text guidance. This allows you to generate video and audio from a model.

Typically, VideoPoet generates video vertically to be consistent with the output of short video.

Google has also made a short film made up of many short films generated by VideoPoet.

In the specific text, the researchers asked Bard to write a short story about a traveling raccoon, along with a breakdown of the scene and a list of tips. Then, a video clip is generated for each prompt, and all the generated clips are spliced together to produce the final video below.

Video storytelling can create visual storytelling through hints that change over time.

Input: a walker made of water extends: a walker made of water. There was lightning in the background and purple smoke emitted from the man.

Input: two raccoons riding motorcycles on a mountain road surrounded by pine trees, 8k expansion: two raccoons riding motorcycles. The meteor shower fell from behind the raccoon, hitting the ground and causing an explosion. LLM second change Video Generator currently, Gen-2 and Pika 1.0 video generation performance is amazing, but unfortunately, it is unable to perform amazingly in the video generation of continuous large movements.

Usually, when they produce a large action, the video will appear obvious artifacts.

In response, Google researchers proposed VideoPoet, which can perform a variety of video generation tasks, including text to video, image to video, video stylization, video repair / expansion, and video to audio.

Compared with other models, Google's approach is to seamlessly integrate a variety of video generation functions into a single large language model, without relying on dedicated components trained for each task.

Specifically, VideoPoet mainly consists of the following components:

Pre-trained MAGVIT V2 video tokenizer and SoundStream audio tokenizer can convert image, video, and audio clips of different lengths into discrete code sequences in a unified vocabulary. This code is compatible with the text-based language model and is easy to combine with other modes such as text.

The autoregressive language model can learn across modes between video, image, audio and text, and predict the next video or audio token in the sequence in an autoregressive way.

A variety of multimodal generation learning objectives are introduced into the large language model training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video repair / expansion, video stylization and video-to-audio. In addition, these tasks can be combined to achieve additional zero-sample functionality (for example, text-to-audio).

VideoPoet can perform multitasking on a variety of video-centric inputs and outputs. Among them, LLM can choose to use text as input to guide the generation of text-to-video, image-to-video, video-to-audio, stylized, and magnifying tasks. A key advantage of using LLM for training is that many of the scalable efficiency improvements introduced in the existing LLM training infrastructure can be reused.

However, LLM runs on discrete token, which can pose challenges to video generation.

Fortunately, video and audio tokenizer can encode video and audio clips into discrete token sequences (that is, integer indexes) and convert them back to the original representation.

VideoPoet trains an autoregressive language model to learn across video, image, audio, and text modes by using multiple tokenizer (MAGVIT V2 for video and images, and SoundStream for audio).

Once the model generates the token based on the context, you can use the tokenizer decoder to convert the token back to a viewable representation.

VideoPoet task design: different modes are converted to token through tokenizer encoders and decoders. Each mode is surrounded by a boundary token, and the task token represents the three major advantages of the type of task to be executed. To sum up, VideoPoet has the following three advantages over video generation models such as Gen-2.

A longer video VideoPoet can generate a longer video by adjusting the last second of the video and predicting the next second.

Through repeated loops, VideoPoet pass not only extends the video well, but also faithfully preserves the appearance of all objects even in multiple iterations.

Here are two examples of VideoPoet inputting a video of growing up from text:

Left: astronauts dance on Mars with colorful fireworks in the background. Right: a very sharp elven stone city filmed by drones in the jungle, with a blue river, waterfalls and steep vertical cliffs. Compared with other models that can only generate 3-4 seconds video, VideoPoet can generate up to 10 seconds of video at a time.

A very important ability to accurately control the video generation application of the autumn scene of the castle captured by drones is how much control users have over the dynamic effects generated.

This will largely determine whether the model can be used to make complex and coherent long videos.

VideoPoet can not only add dynamic effects to the input image through text description, but also adjust the content through text prompts to achieve the desired results.

Left: turn around and look at the camera; right: yawn in addition to supporting video editing of input images, video input can also be precisely controlled by text.

For the leftmost little raccoon dance video, users can use text to describe different dance moves to make it dance differently.

Generate "left": robot dance to generate "medium": Griddy dance to generate "right": to a Freestyle, you can also edit existing video clips generated by VideoPoet interactively.

If we provide an input video, we can change the motion of the object to perform different actions. The operation of the object can be centered on the first frame or the middle frame, so as to achieve a high degree of editing control.

For example, you can randomly generate some clips from the input video, and then select the next clip you want.

For example, the leftmost video in the figure is used as a conditional reflection, and four videos are generated at the initial prompt:

A close-up of a lovely rusty, worn-out steampunk robot covered with moss and new buds, surrounded by tall grass.

For the first three outputs, there is no prompt for autonomous prediction generation of the action. In the last video, "start, background is smoke" is added to the prompt to guide the action generation.

The operation of the mirror VideoPoet can also be used to accurately control the changes of the screen by adding the desired mode of operation of the mirror to the text prompt.

For example, the researchers used the model to generate an image with a hint of "adventure game concept map, snowy mountain sunrise, clear river". The following example adds the given text suffix to the desired action.

From left to right: stretch, slide zoom, pan to the left, arc motion lens, rocker arm shooting, UAV aerial evaluation results finally, what is the performance of VideoPoet in specific experimental evaluation?

To ensure the objectivity of the assessment, Google researchers ran all the models on various prompts and asked people to rate their preferences.

The following figure shows the percentage of VideoPoet selected as the green preference in the following questions.

Text fidelity:

The user preference rating for text fidelity, that is, the percentage of preferred video in terms of accurately following prompts.

Action fun:

The user's preference rating for interesting actions, that is, the percentage of videos preferred in terms of generating interesting actions.

Taken together, an average of 24-35% of people think that examples generated by VideoPoet follow tips more than other models, while the proportion of other models is only 8-11%.

In addition, 41% of the evaluators thought the sample actions in VideoPoet were more interesting, while only 11% of the other models were more interesting.

For future research, Google researchers said that the VideoPoet framework will enable the generation of "any-to-any", such as extending text to audio, audio to video, and video subtitles.

Netizens can't help asking whether Runway and Pika can withstand the upcoming text-to-video innovations from Google and OpenAI.

Reference:

Https://sites.research.google/videopoet/

Https://blog.research.google/2023/12/videopoet-large-language-model-for-zero.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.