Computer Vision GPT moment: UC Berkeley Big three presented the first pure CV model, reasoning surprised AGI sparks 07/06 Update SLTechnology News&Howtos

Computer Vision GPT moment: UC Berkeley Big three presented the first pure CV model, reasoning surprised AGI sparks

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

UC Berkeley's CV Big three launched the first pure visual model without natural language, proving for the first time that the pure CV model is also extensible. What is even more shocking is that LVM can also do the right graphic reasoning questions, and the AGI spark appears again?

The GPT moment of computer vision is coming!

Recently, the "Big three" of computer vision from UC Berkeley jointly launched the first pure visual model without natural language (Large Vision Models), and proved for the first time that the pure visual model itself is extensible (scalability).

In addition, researchers also use data sets of more than 420B token to enable models to understand and perform downstream tasks through contextual learning, and unify almost all data forms such as picture / video, supervised / unsupervised, synthetic / real, 2D / 3D / 4D, etc.

Paper address: https://arxiv.org/ abs / 2312.00785 it is worth mentioning that LVM can often make correct inferences when asked to do non-verbal reasoning problems in non-verbal IQ tests (Raven's Progressive Matrices).

In response, the researchers were pleasantly surprised to say that this may mean that LVM also showed a "spark of AGI"!

The counterattack of pure visual model now, with the outbreak of large language model, both academia and industry begin to try to use "text" to expand the scale of visual model.

SOTA models, including GPT4-V, train vision and text together.

Take "Apple" as an example, this method will not only show the model "pictures of apples" during training, but also be accompanied by the words "this is an apple".

However, when faced with more complex images, it is easy to ignore a large amount of information.

For example, how should the Mona Lisa be described? Or photos of kitchens filled with all kinds of objects are also difficult to describe clearly.

In response, researchers from UC Berkeley and Johns Hopkins University have proposed a new "visual sequence" modeling method, which can train large-scale visual models (Large Vision Model) without using any language data.

This general format, called "visual sequence", can represent original images and videos, as well as annotated data sources such as semantic segmentation and depth reconstruction, without the need for any meta-knowledge beyond pixels.

Once such a wide range of visual data (including 420 billion token) is represented as a sequence, the model can be trained to minimize the cross-entropy loss predicted by the next token.

The resulting LVM model can not only effectively expand and complete a variety of visual tasks, but also further emerge abilities such as counting, reasoning, intelligence testing and so on.

Left: Alexei An Efros;: Trevor Darrell; right: Jitendra Malik simply means that large-scale visual models can understand and process complex visual information only by looking at pictures, without relying on language data at all.

Previously, the value of using pre-training models (such as ImageNet's pre-trained AlexNet) was demonstrated in R-CNN as early as 2015.

Since then, it has become the standard practice of computer vision.

Self-supervised pre-training is proposed as a method to greatly increase the amount of data that can be used for pre-training.

Unfortunately, this approach was not very successful, probably because the CNN-based architecture did not have enough capacity to absorb data at the time.

With the introduction of Transformer, its capacity becomes much higher, so researchers re-examine self-supervised pre-training and find that Transformer-based masked image reconstruction methods, such as BEiT and MAE,SimMIM, perform much better than similar methods based on CNN.

However, in spite of this, the current pre-trained pure visual model still encounters difficulties when it is extended to a really large data set, such as LAION.

How to build a "large visual model"? what elements are needed to build a large-scale visual model (Large Vision Model,LVM)?

The animal world tells us that visual ability does not depend on language. Many experiments have shown that the visual world of non-human primates is very similar to that of humans.

Therefore, this paper goes in a different direction of the visual-language model of LLaVA: how far can we go with only pixels?

Researchers try to imitate two key features of LLM in LVM: (1) scalability in big data environment, and (2) flexible assignment of tasks through cues (contextual learning).

To achieve this goal, three main components need to be identified:

Data: the researchers hope to take full advantage of the significant diversity of visual data.

The first is the original unannotated images and videos. Next, the researchers plan to use a variety of annotated visual data resources generated in the past few decades, such as semantic segmentation, depth reconstruction, key points, multiple views of 3D objects, and so on.

To this end, they defined a general format called "visual sequence" to represent these different annotations without requiring any meta-knowledge beyond the pixels themselves. The training data set contains a total of 164 million images / frames.

Architecture: the researchers used a large Transformer architecture with 3 billion parameters that was trained on visual data represented as token sequences.

Through the learned tokenizer, each image is mapped to a string containing 256VQ token.

Loss function: researchers have taken inspiration from the field of natural language processing, in which the masked token model has evolved into sequential autoregressive prediction.

Once the image / video / tagged image can be represented as a sequence, the model can be trained to minimize the cross-entropy loss of predicting the next token.

Through this minimalist design, researchers have made some novel discoveries--

-as the model size and data size increases, the model exhibits appropriate expansion behavior.

By designing appropriate visual cues during testing, a variety of visual tasks can be solved.

-A large number of unsupervised data, which significantly improves the performance of various standard visual tasks.

-the model shows general visual reasoning ability when dealing with out-of-distribution data and performing novel tasks, but further investigation and research is needed.

data

data! data! data! I can't make bricks without clay! Sherlock Holmes the key to any large pre-training model must be trained with a large amount of data.

It is easy for a language model to obtain a very diverse large dataset.

For example, the popular CommonCrawl repository, which contains 250 billion web pages scanned across the web, is extremely diverse, and includes "natural demos" such as language translation and question answers.

However, in the field of computer vision, there is still a long way to go to have the same scale and diversity of data sources.

Therefore, one of the core contributions of researchers is to build such a unified visual data set (UVDv1).

To this end, the researchers used many different visual data sources: (1) untagged images, (2) visually annotated images, (3) untagged videos, (4) visually annotated videos, and (5) 3D composite objects.

Among them, unlabeled images account for more than 80% of the total data, make up most of the visual world, and provide the required diversity, but at the cost of low quality data sources.

The distribution of annotated images is more limited, but generally of higher quality.

Video data is more limited (generally human-centered activities), but they are valuable sources of temporal data.

3D composite objects have the lowest rendering diversity, but can provide valuable hints about 3D structural behavior.

Most importantly, UVDv1 is a purely visual dataset that does not contain non-visual metadata such as text.

All in all, UVDv1 contains 1.64 billion images.

Another important difference from LLM is that language data has a natural and unified one-dimensional structure for all data-text flow.

Unfortunately, this is not the case with visual data, and different sources have different structures.

Therefore, in this work, researchers propose visual sequences as a unified unit of visual data, which enables them to train scalable models from different collection sources.

A visual sequence is just a sequence of one or more images, followed by an EOS token.

Figure 1 shows how various data sources are divided into visual sequences.

Single image single image itself represents the simplest form of visual sequence-{image, EOS}.

The researchers used a filtered subset of 1.49 billion images from the LAION 5B data set.

This is by far the largest portion of the data, accounting for 88.5%.

Image sequence is the natural form of visual sequence.

The researchers created such sequences by obtaining video data from a variety of existing data sets.

The visual sequence of 16 frames is formed by sampling the video at three different steps (10, 20 and 30).

In addition, the researchers used synthetic 3D objects from 0bjaverse data sets to generate object-centered multi-view sequences.

For each object, the researchers sampled a radius of 1.5 to 2.2 between the center of the object and the camera, and a constant elevation from-45 to 45 degrees. Then traverse the different viewing angles of the object (changing the azimuth in 15-degree steps and rendering 24 angles).

In this way, the researchers rendered a total of 42000 such sequences for training and 8000 for testing.

Finally, images belonging to the same semantic category can be represented as (part of) the sequence.

Use categories in ImageNet to join groups of images (2, 4, 8, or 16) in the same category into a long sequence of 16 images.

Annotated images in order to deal with different types of image annotations in a unified way, the researchers chose to represent all annotations as images.

Some data types, such as semantic segmentation, edge, depth and ordinary images, are already represented in this way.

For other data types, the researchers tailored different methods for each specific annotation type--

1. Object detection: creates annotations by covering each object with a color-coded bounding box.

two。 Human posture: using MMPose, following the OpenPose format, rendering human bones in pixel space.

3. Depth estimation, surface normals and edge detection: for given ImageNet and COCO images, annotations are generated according to a specific protocol.

4. Style transfer, rain removal, denoising, low light enhancement and stereoscopic data sets: these are all represented as image pairs (such as input / output).

5. Shading: converts an ImageNet image to a grayscale image to generate image pairs.

6. Repair: randomly add black boxes to the image to simulate damage, resulting in image pairs.

For all the above annotation types, a visual sequence can be created by connecting 8 image pairs of the same annotation type into a visual sequence of 16 images.

For a dataset containing k different annotations of the same image, different methods are used: for each group of 1cm k images (input more than k annotations), then m elements are randomly selected, of which m ≤ natives 1 ≤ 16. Then these m-tuples are connected to form a visual sequence.

The annotated image sequence adopts two complementary strategies when converting the annotated video data (VIPSeg, Hand14K, AVA, JHMDB) into visual sequences.

The first strategy is similar to dealing with annotated image data in pairs: each visual sequence is constructed by connecting frames to their annotations-{frame1,annot1,frame2,annot2,...}.

The second method is to group multiple frames with the corresponding annotation {frame1,frame2,annot1,annot2,...}.

The realization method is different from the text data which naturally shows the structure of discrete sequence, so it is not intuitive to model image pixels as visual sequences. In this work, the researchers adopted a two-stage approach:

1. Train a large visual tokenizer (operating on a single image) to convert each image into a series of visual token

two。 An autoregressive Transformer model is trained on visual sequences, and each sequence is represented as a series of token.

Image word segmentation (Image Tokenization) although the visual sequence shows a sequence structure between consecutive images, there is no such natural sequence structure within a single image.

Therefore, in order to apply the Transformer model to the image, the previous work usually uses the following methods: either divide the image into patches in Scanline order and regard it as a sequence, or use a pre-trained image tokenizer, such as VQVAE or VQGAN, to cluster image features into discrete token of one lattice, and then convert these token into sequences in Scanline order.

The researchers use the latter method because the discrete classification output of the model naturally forms a probability distribution that can be easily sampled, which makes it possible to generate new images flexibly in the visual sequence.

Specifically, the researchers used the semantic token generated by the VQGAN model. The framework includes encoding and decoding mechanisms and is characterized by a quantization layer that assigns the input image to a discrete token sequence of an established codebook.

The encoder and decoder are composed entirely of convolution layers. The encoder is equipped with a number of downsampling modules to compress the spatial dimension of the input, while the decoder is equipped with an equal amount of upsampling modules to restore the image to its original size.

For a given image, the researchers' VQGAN tokenizer produces 256 discrete token.

It should be noted that the researchers' tokenizer operates on a single image independently, rather than on the entire visual sequence at once.

This independence allows researchers to separate tokenizer training from downstream Transformer models, so that tokenizer can be trained on a single image data set, regardless of the distribution of visual sequences.

Implementation details: the researchers adopted an off-the-shelf VQGAN architecture. The downsampling factor of fau16 and a codebook of 8192 size are used. This means that for an image with a size of 256 × 256, the researchers' VQGAN tokenizer produces 16 × 160256 token, of which each token can take 8192 different values.

The researchers found that tokenizer pre-trained with ImageNet did not have good generalization performance beyond ImageNet images. So the researchers trained their own tokenizer on a 1.5B subset of the LAION 5B dataset.

After the sequence modeling of visual sequence uses VQGAN to convert the image into discrete token, the researchers regard the visual sequence as a unified sequence by connecting the discrete token of multiple images into a 1D sequence.

It is important that researchers treat all visual sequences equally-researchers do not use any special token to indicate specific tasks or formats.

The researchers used cross-entropy loss to train a causal Transformer model with the goal of predicting the next token, similar to the standard method of language models. The model is trained in the same way to deal with all visual sequences so that the model can infer the relationship between images from context rather than from task-specific or format-specific token. This makes the model have the opportunity to be extended to other unseen visual sequence structures.

Implementation details: the researchers segmented each image in the visual sequence into 256 token and then concatenated them into a 1Dtoken sequence.

Based on the visual token sequence, the researchers' Transformer model is almost the same as the autoregressive language model, so the researchers adopt the Transformer architecture of LLaMA.

The researchers used a context length of 4096 token to adapt to the 16 images under the researcher's VQGAN tokenizer.

Similar to the language model, researchers add a [BOS] (sequence start) token at the beginning of each visual sequence, a [EOS] (sequence end) token at the end, and use sequence linking (sequence concatenation) during training to improve efficiency.

The researchers trained the researchers' models over the entire UVDv1 dataset (420 billion token), using one cycle (using simple periodic training in the language model to avoid potential overfitting).

The researchers trained four models with different number of parameters: 300 million, 600 million, 1 billion and 3 billion, following the same training configuration.

Reasoning through visual cues because the autoregressive Transformer in the researcher's model outputs the probability distribution of the next token based on the previous token, researchers can easily sample from this distribution to generate a new visual token that completes the visual sequence.

To use the model for downstream tasks, you can build a partial visual sequence that defines the task at test time, and apply the model to generate output. This is similar to contextual learning in language models or visual cues in computer vision.

Experimental results and analysis

Finally, the researchers assessed the scalability of the model and its ability to understand and answer various prompt tasks.

Scalability researchers studied the expansion behavior of the researchers' model in terms of training loss and downstream task performance, with the increase of model size and the increase of the number of token seen during training.

Loss of training. First, the researchers examined the training losses of LVM with different parameter sizes, as shown in the figure below.

Because all the researchers' models only train one epoch on the data set, the model only sees each data sample once, so the training loss at any time in the training process is very similar to the verification loss.

It can be observed that as the training goes on:

1. The training loss (confusion) of different size models continues to decline.

two。 With the increase of model size (parameter count), the loss decreases faster. These observations show that LVM shows strong scalability in terms of larger models and more data.

Although the overall loss of LVM expands well during training, there is no guarantee that a better overall model will perform better on specific downstream tasks.

Therefore, the researchers evaluated models of different sizes on four downstream tasks: semantic segmentation, depth estimation, surface normal estimation and edge detection. The researchers evaluated these tasks on the ImageNet validation set.

For each task, the researchers gave five pairs of inputs and corresponding real-world annotations as well as query images as input prompts, and evaluated the confusion prediction of the researchers' model for the next 256 token (one image) real-world annotations.

In the figure below, the researchers show that the larger model does achieve a lower degree of confusion on all tasks, showing that the researchers' scalable overall performance does translate into a series of downstream tasks.

Although LVM achieves better performance on larger models and more data, a natural question is whether each data component collected in UVDv1 is helpful.

To answer this question, the researchers conducted ablation studies on several 3B models on the researcher's data set, which were trained on a subset of the researcher's data set and compared their performance on downstream tasks.

The researchers used the same four downstream tasks and settings and showed the results in the following figure.

The researchers observed that each data component made a positive contribution to downstream tasks. LVM not only benefits from larger data, but also improves with the increase in diversity in the dataset, including tagging and unsupervised image and video data.

Sequential hint researchers first use the most intuitive and simplest method to visually prompt LVM: sequential reasoning. Here, the tip build is simple: the researchers showed the model a sequence of seven images and asked it to predict the next image (256 token).

For sequential prompts, the most direct task is video prediction. The following figure shows several examples of the next frame prediction prompted from the Kinetics-700 validation set sequence.

In the top example, the 7 frame prompt (blue border) is followed by the predicted frame (red border). The researchers observed a certain degree of reasoning ability in spatial positioning, viewpoint and object understanding. The degree of confusion predicted on the Kinetics verification set is 49.8.

The following example shows a prediction with a longer context (15 frames) and a longer prediction (4 frames).

The same type of simple sequential prompts can also be used in other ways. For example, the following illustration shows how to predict further rotations by prompting the model for a 3D rotation sequence of composite objects around any axis.

Or researchers can treat a list of items in a given category as a sequence and predict other ideas in that category, as shown in the following figure.

It is worth noting that although the system is trained on a group of images in the same ImageNet category, the tips here include sketches, which do not appear in any tagged data.

Next, the researchers studied how many timing contexts are needed to accurately predict subsequent frames.

The researchers assessed the frame generation confusion of the model under contextual cues of different lengths (1 to 15 frames). As shown in the following figure, on the Kinetics-700 verification set, the confusion improved significantly from 1 to 11 frames and stabilized (from 62.1 → 48.4).

Analogies suggest that researchers' research is progressing by evaluating a more complex cue structure, which the researchers call "Analogy Prompting". This method challenges the model to understand the analogy of arbitrary length and complexity, thus testing its advanced interpretation ability.

The following figure shows a sample of qualitative results using analogical hints on multiple tasks. The prompt includes a sequence of 14 images, giving examples of various tasks, followed by the 15th query image. Given each prompt, the predicted next image.

The top of the figure shows several sample tips for defining tasks in a training set (but these actual images have never been seen in training). The lower part of the figure shows the generalization of tasks that have never been shown in training.

The researchers showed the results of key point detection on Pascal 3D+, using the standard percentage of correct key points (PCK) metric with a threshold of 0.1. It is worth noting that LVM achieved an PCK of 81.2 without training this dataset, showing impressive generalization capabilities.

In contrast, the researchers demonstrated some existing task-specific models: the PCK of StackedHourglass is 68.0 PCK,StarMap and 78.6 PCK.

Comparison with visual cues

The closest approach to the researchers' approach, which also allows for the definition of any task, is visual cues. In the following table, the researchers compared the performance of several visual cue models on a small number of sample segmentation, object detection and shading tasks. The researchers' sequential LVM surpassed the previous approach on almost all tasks.

Task combination

The following figure illustrates combining multiple tasks in a single prompt. The researchers showed that the rotation task corresponds to the new key point and asked the model to continue this pattern. The model can successfully combine these two tasks during testing, showing a certain degree of combination.

Other types of hints researchers try to see how far the model can go by providing the model with hints that it has never seen before.

The following figure shows some of these tips, and the results are very good.

The following figure shows some hints that are not easy to describe in language-this is the type of task that LVM may eventually outperform LLM.

Reference:

Https://arxiv.org/abs/2312.00785

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.