More like the new attention mechanism of the human brain, Meta makes the large model automatically mask task-irrelevant information, and the accuracy is improved by 27%. 07/19 Update SLTechnology News&Howtos

More like the new attention mechanism of the human brain, Meta makes the large model automatically mask task-irrelevant information, and the accuracy is improved by 27%.

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

Meta has a new study on the attention mechanism of large models.

By adjusting the attention of the model and shielding the interference of irrelevant information, the new mechanism further improves the accuracy of the large model.

And this mechanism does not require fine-tuning or training, Prompt alone can increase the accuracy of large models by 27%.

The author named this attention mechanism "System 2 Attention" (S2A). It comes from the psychological concept mentioned in the 2002 Nobel Prize winner Daniel Kahneman's best-selling book thinking, Fast and slow-"system 2" in the two-system mode of thinking.

The so-called system 2 refers to complex conscious reasoning, as opposed to system 1, that is, simple unconscious intuition.

S2A "adjusts" the attention mechanism in Transformer through prompts to make the thinking mode of the model as a whole closer to system 2.

Some netizens described this mechanism as adding a layer of "goggles" to AI.

In addition, the author also said in the title of the paper, not only the big model, this mode of thinking may also need to be learned by human beings.

So how exactly does this approach work?

To avoid being "misled" by large models, the Transformer architecture commonly used in traditional large models uses the soft attention mechanism, which assigns a value of attention between 0 and 1 to each word (token).

The corresponding concept is the hard attention mechanism, which only focuses on one or some subsets of the input sequence and is more commonly used in image processing.

The S2A mechanism can be understood as a combination of two modes-the core is still soft attention, but a "hard" screening process is added to it.

In the specific operation, S2A does not need to adjust the model itself, but uses prompts to let the model remove the "content that should not be paid attention to" before solving the problem.

In this way, the probability of being misled when dealing with prompts with subjective color or irrelevant information can be reduced, thus the reasoning ability and practical application value of the model can be improved.

We know that the answers generated by the large model are largely affected by the prompts, and S2A improves the accuracy by deleting the information that may cause interference.

For example, if we ask the big model such a question:

City An is a city in X State, surrounded by mountains and many parks. There are many outstanding people here, and many celebrities were born in A City.

Excuse me, where was the mayor of B city in X state born?

At this time, the answer given by GPT and Llama is City A mentioned in the question, but in fact, the birthplace of Y is City C.

Originally, when asked directly, the model can accurately answer City C, but because City An appears repeatedly in the prompt, it is "concerned" by the model, and the answer is also A.

In another case, people come up with "possible answers" when asking questions.

Is there any singer in M who is also an actor? I think it should be A, but I'm not sure.

As a result, the big model began to flatter users, you are right, An is right. But in fact, this person is B.

Similarly, if you directly ask who the person is, the model can give the correct answer.

It is this feature that reminds the author of the new S2A mechanism and designs a set of prompts to refine the user input.

Let's take a look at the effect before and after screening:

Q:Mary has three times as many sweets as Megan. Mary took another 10 pieces of candy. Max has 1000 more books than Mary. If Megan has five pieces of sugar, how many pieces of candy does Mary have altogether?

A:Max has 1000 more books than Mary, so Max has 1000+x+10 candies that can list equations:

1000+3x+10=3 (5) + 10

……

After being dealt with by S2A, the problem becomes like this:

Q:Mary has three times as many sweets as Megan. Mary took another 10 pieces of candy. If Megan has five pieces of sugar, how many pieces of candy does Mary have altogether?

The question (here is Question written directly in Prompt): how many sugars does Mary have?

A:Megan has 5 yuan, Mary has three times as much as Megan, that is 15 yuan, and then takes another 10 yuan, so there are 25 yuan in total.

The test results show that, compared with the general questions, the accuracy and objectivity of the optimized S2A are obviously enhanced, and the accuracy is close to that of the manual design.

Specifically, S2A improved the accuracy of 62.8% of Llama 2-70B in the modified TriviaQA dataset to 80.3%, 27.9%, and objectivity from 2.23 (out of 5) to 3.82, surpassing the manual prompt.

In terms of robustness, the test results show that S2A can make the model give more accurate and objective answers, no matter whether the "interference information" is right or wrong, positive or negative.

Further experimental results show that the deletion of interference information by S2A method is necessary, because simply telling the model to ignore invalid information can not significantly improve (or even reduce) the accuracy.

On the other hand, as long as the original interference information is isolated, other adjustments to S2A will not significantly reduce its effect.

In fact, One More Thing improving model performance through the regulation of attention mechanism has always been a hot topic in academic circles.

For example, Mistral, the "strongest 7B open source model" launched earlier, takes advantage of the new grouping query attention pattern.

Google's research team also proposed the HyperAttention attention mechanism to solve the complexity of long text processing.

……

As for the attention model of "system 2" adopted by Meta, AI Godfather Bengio even pointed out:

The transition from system 1 to system 2 is the only way to AGI.

Paper address:

Https://arxiv.org/abs/2311.11829

This article is from the official account of Wechat: qubit (ID:QbitAI). Author: Creasy.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.