GPT-4 reading comprehension, other big models are not good. 08/16 Update SLTechnology News&Howtos

GPT-4 reading comprehension, other big models are not good.

2025-08-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

The study shows that the order of Chinese characters is not always read aloud (in the case of English, it is the alphabetical order of each word).

Now, an experiment at the University of Tokyo in Japan has found that this "theorem" is also suitable for GPT-4.

For example, in the face of such a "ghost symbol", almost every letter of every word is disturbed:

OJn amRh wno het 2023 Meatsrs ermtnoTuna no duySan taatgsuAu ntaaNloi Gflo bClu, gnelcinhi ish ifsrt nereg ecatkjnad ncedos raecer jroam .

But GPT-4 perfectly restored the original sentence (red box):

It turned out to be the story of a man named Jon Rahm who won the American Masters in 2023.

And if you ask GPT-4 directly about the garbled code, it can understand it first and then give the correct answer, without affecting reading at all:

In response, the researchers were very surprised:

In theory, garbled words can seriously interfere with the tokenization processing of the model, and it is a bit counterintuitive that GPT-4 is as unaffected as humans.

It is worth mentioning that this experiment also tested other large models, but they all failed the challenge-some and only GPT-4 succeeded.

What do you say exactly?

Text order does not affect GPT-4 reading in order to test the ability of the large model to resist text disorder, the author constructed a special test benchmark: Scrambled Bench.

It consists of two types of tasks:

One is scrambling sentence recovery (ScrRec), which tests the ability of large models to recover out-of-order sentences.

Its quantitative indicators include something called the recovery rate (RR), which can be simply understood as the percentage of words recovered by a large model.

The other is scrambling question answering (ScrQA), which measures the ability of the large model to correctly understand and answer questions when the words in the contextual material are disturbed.

Because the ability of each model itself is different, it is difficult for us to evaluate this task directly with accuracy, so the author uses a quantitative index called relative performance gain (RPG).

The specific test materials are selected from three databases:

One is RealtimeQA, which publishes weekly updates that the current LLM is unlikely to know.

The second is DREAM (Sun et al.,2019), a dialogue-based multi-choice reading comprehensive data set.

Finally, there is AQuARAT, a data set of mathematical problems that requires multi-step reasoning to solve.

For each dataset, the author picks out topics and makes varying degrees and types of interference, including:

1. Random scrambling (RS), that is, for each sentence, randomly select a certain proportion of words (20%, 50%, 100%) and scramble all the letters in these words (the numbers remain the same).

2. Keep the first letter of each word unchanged, and KF the rest at random.

3. Keep the first and last letter of each word unchanged, and KFL the rest at random.

There are many models involved in the test, and the main body of the article reports the following:

Text-davinci-003, GPT-3.5-turbo, GPT-4, Falcon-180b and Llama-2-70b.

First of all, let's look at the effects of different types of interference.

As shown in the following figure:

In the KFL setting (that is, the first and last letters remain the same), whether it is scrambling sentence recovery or scrambling Q & A tasks, the performance gap between models is small.

However, as the interference became more difficult (after becoming KF and RS), the performance of the models degraded significantly-- except for GPT-4.

Specifically, in the scrambled sentence recovery (ScrRec) task, the recovery rate of GPT-4 is always higher than 95%, and in the scrambled question and answer (ScrQA) task, the relative accuracy of GPT-4 is always around 85%.

By contrast, some of the other models have fallen to less than 20%.

The second is the influence of different scrambling rates.

As shown in the following figure, you can see that in the scrambled sentence recovery (ScrRec) task, as the number of interfered words in a sentence increases, until after 100%, only the performance of GPT-3.5-turbo and GPT-4 does not change significantly, of course, GPT-4 still has a lot of priority over GPT-3.5.

In the scrambling question and answer (ScrQA) task, with more and more disturbed words in the sentence, the performance of all models decreased significantly, and the gap became wider and wider.

But among them, GPT-4 is still far ahead with a score of 87.8%, and the decline is the slightest.

So it can be summarized as follows:

Most models can handle a certain proportion of intrusive text, but to extreme levels (such as all words messed up), only GPT-4 performs best, and only GPT-4 is almost unaffected by the completely chaotic word order.

GPT-4 is also good at word segmentation. At the end of the article, the author points out:

In addition to disturbing the alphabetical order of words, you can also study the effects of inserting letters, replacing letters, and so on.

The only problem is that since GPT-4 is a closed source, it is difficult to investigate why GPT-4 can not be affected by word order.

Some netizens have found that in addition to the situation proved in this article, GPT-4 is also very good at connecting the following paragraph in English:

UNDERNEATHTHEGAZEOFORIONSBELTWHERETHESEAOFTRA

NQUILITYMEETSTHEEDGEOFTWILIGHTLIESAHIDDENTROV

EOFWISDOMFORGOTTENBYMANYCOVETEDBYTHOSEINTHEKN

OWITHOLDSTHEKEYSTOUNTOLDPOWER

To separate correctly:

Underneath the gaze of Orion's belt, where the Sea of Tranquility meets the edge of twilight, lies a hidden trove of wisdom, forgotten by many, coveted by those in the know. It holds the keys to untold power.

In theory, this kind of word segmentation operation is a very troublesome thing, usually requires dynamic programming and other operations.

The ability shown by GPT-4 once again surprised the netizen.

He also put this into OpenA's official tokenizer tool and found that the token seen by GPT-4 actually looks like this:

UNDER NE AT HT HE GA Z EOF OR ION SB EL TW HER ET HE SEA OF TRA

With the exception of "UNDER", "SEA" and "OF", almost all the remaining token looks "illogical", which is even more puzzling.

What do you think of this?

Reference link:

[1] https://arxiv.org/abs/2311.18805

[2] https://news.ycombinator.com/item?id=38506140

This article is from the official account of Wechat: quantum bit (ID:QbitAI), author: Fengcai

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.