AI Variance Film Master, Interstellar can understand! Jia Jiaya team's new work, multi-modal large model challenge ultra-long 3-hour video 07/13 Update SLTechnology News&Howtos

AI Variance Film Master, Interstellar can understand! Jia Jiaya team's new work, multi-modal large model challenge ultra-long 3-hour video

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

What? AI can watch blockbusters by himself?

The latest research by Jia Jiaya's team has made the big model learn to deal with ultra-long videos directly.

Throw it a sci-fi blockbuster Interstellar (2 hours and 49 minutes long):

After it "watches", it can not only comment on the film easily with the combination of the plot and characters:

And can accurately answer the details involved in the play:

For example: who is the role and creator of wormholes?

A: future intelligent creatures are placed near Saturn to help humans travel long distances.

How does the hero Cooper transmit the message from the black hole to his daughter Murphy?

A: the data is transmitted in Morse code through the watch.

Ah, it feels like the job of a movie blogger is going to be taken away by AI.

This is the latest multimodal large model LLaMA-VID, which supports three inputs: single image, short video and long video.

In contrast, similar models, including GPT-4V, can only deal with images.

And the principle behind it is more interesting.

According to reports, LLaMA-VID has achieved the above capability only through a very simple method, that is:

Compress the number of token representing each frame to only 2.

What is the specific effect and how to achieve it? Let's take a look.

Character understanding analysis, film details are all OK for the film, in addition to accurate answers to the details involved, LLaMA-VID can also make a very accurate understanding and analysis of the characters.

Or "Interstellar", we asked it to analyze the speed and reason of the relative time on Miller's planet.

The result is absolutely correct:

LLaMA-VID says it's because Miller is near a black hole, causing an hour to equal seven years on Earth.

The True Story of Forrest Gump, which lasted nearly two hours when it was thrown to it.

For "what does Jenny mean to Forrest Gump?" LLaMA-VID 's answer to this question is:

A childhood friend, who later became Forrest Gump's wife, is the comfort of Forrest Gump's life and soul.

Forrest Gump's events during the war and after leaving the army can also be analyzed, and the answer is also very good:

Lieutenant Dan blamed Forrest Gump for saving himself because it made it impossible for him to kill battlefield.

In addition to watching movies and becoming an omniscient "movie partner", it can also well understand the intentions of promotional films and answer some open questions.

For example, give it a recently popular GTA6 trailer.

Ask it, "what attracts you most about this game?" The idea it gives after "reading" is:

First, there are a lot of game scenes and settings (from racing, stunt driving to shooting, etc.), and second, the visual effects are quite amazing.

Oh, by the way, LLaMA-VID can also infer that the trailer is promoted by Rockstar Games based on the scenes and features in the game:

And recognize that the background city of the game is Miami (based on nightlife, beaches and other information, and after the author hints that the game is set in Florida).

Finally, in addition to promotional videos and movies up to 2-3 hours long, let's also take a look at LLaMA-VID 's ability to understand the most basic picture information.

Well, it can be accurately identified as a piece of cloth with a hole in it:

It's okay to let it play Sherlock Holmes. Faced with a picture of the interior of a room like this:

It can hang a lot of coats from the door to analyze that the owner of the room may be busy / go out a lot.

It can be seen that LLaMA-VID 's accurate interpretation of video is based on such a picture level, but the most important point is how it completes video processing for such a long time.

A few lines of code to achieve a single frame of 2 token represents the key innovation of LLaMA-VID is to compress the number of token per frame to a very low level, so that it can handle very long video.

Many traditional large multimodal models encode too much token for a single image, resulting in a sharp increase in the number of token required after the video time is lengthened, which is difficult for the model to bear.

For this reason, the research team redesigned the image coding method, using context coding (Context Token) and image content coding (Content Token) to encode a single frame in the video.

As a result, each frame is represented by 2 token.

Specifically, look at the framework of LLaMA-VID.

There are only three parts:

The codec is used to generate visual embedding and text guidance features.

Contextual token and image content token are transformed according to a specific token generation strategy.

Instruction tuning is further optimized.

According to the instruction, LLaMA-VID selects a single image or video frame as input and then generates an answer from the large language model.

This process begins with a visual encoder that converts the input frame into a visual frame embedding.

The text decoder then generates a cross-modal index (Text Query) associated with the input instruction based on the features extracted by the user input and the image encoder.

Then the attention mechanism (Context Attention) is used to aggregate the visual cues related to the text in the visual embedding, that is, feature sampling and combination, so as to generate high-quality instruction-related features.

In order to improve efficiency, the model compresses the visual embedded samples to different token sizes, or even a token.

Among them, the context token is generated according to the questions input by the user, and the visual features related to the user problems are preserved as much as possible.

The image content token samples the image features directly according to the user's instructions, pays more attention to the content information of the image itself, and complements the parts that the context token does not pay attention to.

The text boot context token and the image token represent each frame together.

Finally, the large language model takes user instructions and all visual token as inputs to generate answers.

And the method of generating this token is simple, with only a few lines of code.

In terms of experimental results, LLaMA-VID implements SOTA on multiple video Q & An and reasoning lists.

By adding only one context token extension, LLaMA-VID can also achieve a significant improvement in multiple picture Q & An indicators.

LLaMA-VID achieves good results on 16 video and picture understanding and reasoning data sets.

On GitHub, the team provided all the fine-tuning models for different stages, as well as the pre-training weights for the first phase.

The specific training includes three processes: feature alignment, instruction fine-tuning, and long video fine-tuning (see GitHub for the corresponding steps).

In addition, LLaMA-VID has collected 400 films and generated 9K long video Q & A corpus, including film reviews, character growth and plot reasoning.

Combined with the long text data set LongAlpaca-12k (9k long text Q & A corpus pairs and 3k short text Q & A corpus pairs) previously released by Jia Jiaya's team, the existing multimodal model can be easily extended to support long video input.

It is worth mentioning that since August this year, Jia Jiaya's team has released the LISA multimodal large model, which focuses on reasoning segmentation.

The long text open source large language model LongAlpaca (7 billion parameter) and the ultra-long text extension method LongLoRA were also released in October.

LongLoRA only needs two lines of code to extend the text length of 7B model to 100k tokens,70B model to 32k tokens.

Finally, the team also provides a demo address to upload videos and LLaMA-VID conversations themselves (deployed in a single block 3090, partners who need can refer to code to deploy with larger video memory and talk to the whole movie directly).

It seems that if you can't understand Nolan movies in the future, you can ask AI to try.

Paper address:

Https://arxiv.org/abs/2311.17043

GitHub address:

Https://github.com/dvlab-research/LLaMA-VID

Demo address:

Http://103.170.5.190:7864/

This article is from the official account of Wechat: quantum bit (ID:QbitAI), author: Toshio Mingmin

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.