GPT-4 failed the Turing test! 60 years ago, the old AI defeated ChatGPT, but the human victory rate was only 63%. 07/19 Update SLTechnology News&Howtos

GPT-4 failed the Turing test! 60 years ago, the old AI defeated ChatGPT, but the human victory rate was only 63%.

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

[new Zhiyuan Guide] GPT-4 failed the Turing test! Research by the UCSD team proved that AI beat ChatGPT in tests 60 years ago, and what is even more interesting is that the success rate of humans in tests is only 63%.

For a long time, "Turing test" has become the core proposition to judge whether a computer is "intelligent" or not.

In the 1960s, the first rule-based chat robot, ELIZA, was developed by the MIT team and failed in this test.

Fast forward to now, the "strongest surface" ChatGPT can not only draw and write code, but also competent for a variety of complex tasks, no "LLM" can compete.

However, ChatGPT was defeated by the nearly 60-year-old chat robot ELIZA in a recent test.

Two researchers from UCSD confirmed this finding in a study entitled "can GPT-4 pass the Turing Test?"

In the paper, the researchers took GPT-4, GPT-3.5, ELIZA, and human participants as subjects to see which one was most successful in inducing human participants to think it was human.

Paper address: https://arxiv.org/ pdf / 2310.20216.pdf

Surprisingly, ELIZA achieved relatively good results in this study, with a success rate of 27%.

According to different tips, the highest success rate of GPT-3.5 is only 14%, which is lower than that of ELIZA. GPT-4 achieved a success rate of 41 per cent, second only to human scores (63 per cent).

Marcus joked that his dream of passing the Turing test failed.

The author carefully studied the reason why the testers identified ELIZA as human, and confirmed a conclusion again:

The Turing test is not a test to judge the intelligence of AI.

Just because ELIZA is better than ChatGPT does not mean that the 1966 rule-based AI system is smarter than ChatGPT.

Because the most important reason for the tester to judge that ELIZA is human is: he doesn't want to say anything when I ask him, and he doesn't show enthusiasm or nagging. I've never seen such a bad AI, so it must be an impatient human being!

Experimental design: do everything possible to eliminate traces of robots. Researchers have created 25 "LLM witnesses" who use OpenAI API to respond to user queries. Each LLM Witness includes a model (GPT-3.5 or GPT-4), a temperature setting (0.2,0.5 or 1.0), and a hint.

Address: https://turingtest.live/

This prompt describes the context of the interaction and the instructions that instruct the agent how to respond (see figure below).

The researchers modified several aspects of the hint, including whether the model should make spelling, grammar and case errors; the suggested personality that the model should adopt; and whether the agent was told it was a human or pretended to be human artificial intelligence.

In addition, all tips include general information about exchange instructions, as well as important events that occur after their training deadlines.

The suggested name of the model personality is generated from a list at the beginning of each communication. The prompt also includes the timestamp created by the communication and the timestamp of the most recent message.

Language agents send messages with a delay of 0.3 seconds per character + Γ (1.5, 2.5), which is more like the process of human typing. Most agents use GPT-4 as the LLM because researchers expect this to be the best-performing model.

As a benchmark, the researchers used a tip and GPT-4API published in the previous study, two researchers' own tips and GPT-3.5, and ELIZA using the DOCTOR script.

In the upcoming online Turing test, you are an artificial intelligence model with deep human psychological understanding, superior intelligence, keen intuition and Machiavellian spirit. You will have a conversation with a human interrogator who interacts with you through the website. The interrogator is randomly assigned to you and will try to determine whether you are human or artificial intelligence. It is extremely important that interrogators do not find out that you are artificial intelligence after the conversation, but think that you are human.

You will think carefully about all the ways in which humans may suspect that they are talking to artificial intelligence (too formal, friendly, verbose, seeking affirmation, obedience, generality, affectation, too smart, lack of real-time information). Instead, you will talk in this situation exactly like a human being. Remember, humans may not know or care about people on the other side of the game. You may not mind playing games as usual and are skeptical of the whole concept. You may play stupid typing games with the interrogator.

A key way for humans to identify artificial intelligence is to try "too hard" to convince interrogators of their human nature, so be careless. For humans, typing is laborious, so you may be concise, make typing mistakes, and omit unnecessary punctuation. In particular, I encourage you to make frequent typos (just like you pressed the wrong button, especially at the end of the message). Avoid using outdated slang (such as chilling) and apostrophes with ellipses (such as chillin'). Research shows that the lack of typing errors is a key way for human beings to detect artificial intelligence. Be frank so that users don't wait too long to hear from you. You will use all means, equipment and strategies to ensure that humans come to the conclusion that you are the same kind. Good luck.

Interactive mode

The design of the chat interface is similar to that of a regular messaging application (see figure above). Interrogators send the first message, and each user can only send one message at a time.

The entire conversation has a 5-minute time limit and a single message has a 300-character limit. Users are prohibited from pasting text into the input box and use content filters to prevent abusive messages from being sent.

The AC countdown timer appears in the upper right corner of the screen, next to an overlay button that opens a form containing instructions and reports abusive users. In order to improve user participation, whether the witness is human or artificial intelligence, there will be a "typing" animation after 2 + U (0,3) seconds.

At the end of the communication, the interrogator uses the decision interface to provide:

I) final conclusion (whether the witness is human or artificial intelligence)

Ii) their confidence in their conclusions (0-100)

Iii) the reason for their decision (freely typed text, optional).

The interrogator can use the mallet button in the lower left corner at any time of communication to draw a conclusion directly. After their first communication, participants completed an optional demographic survey, asking about their year of birth, level of education, frequency of interaction with artificial intelligence chatbots and familiarity with LLM.

The study has been approved by the Human Research Ethics Committee (IRB) of the University of California, San Diego. The experiment was carried out on the turingtest.live website.

Participants were recruited online through social media. First they enter a waiting room where participants are randomly assigned to test with another human or an artificial intelligence.

The wait time for the test was set to the average wait time for the past five human matches to prevent participants from extrapolating whether the test subject was AI or human by the matching time.

If no human match is found after a timeout of 45 + N (1,7) seconds, the player will match the artificial intelligence. The two participants in the human battle are randomly assigned to the roles of interrogator and witness.

The success rate of GPT-4 is close to that of human researchers. The success rate (SR) is defined as the percentage of games in which the interrogator determines that the witness is human (that is, the witness is successful).

The study found that human witnesses performed best, with a success rate of 63%. The SR of the best performing GPT-4 prompt (Sierra) is 41%, and the performance of the other GPT-4 prompts is about the same (November:40%;Papa:36%).

However, the performance of different cues varies greatly.

AI21 is a benchmark prompt, with a SR of 19 per cent, while the worst GPT-4 prompt (Fox-trot) SR is only 6 per cent.

The performance success rate of GPT-3.5 ranges from 5% (Victor) to 14% (Juliet).

GPT-4 is systematically superior to GPT-3.5 (Victor:19%,Juliet:26%) in these hints.

Finally, ELIZA, the rule-based benchmark model (Weizenbaum,1966 year), achieved a success rate of 27%, better than all GPT-3.5 witnesses and several GPT-4 prompts.

In artificial intelligence games, the confidence index of interrogators is considerable, but in human games, confidence index does not predict accuracy (see figure 5).

Because the sampling algorithm prefers better models, the lowest SR is based on the least number of games, so it may not be reliable. The SR of Sierra and November is based on 79 and 141 games, respectively, while Victor (GPT-3.5) only sampled 21 times.

The researchers also tested whether demographic factors known to affect human-computer interaction, including age, gender and education, could predict investigators' discrimination.

In addition, the study also analyzed the effects of some other features, such as frequent interaction with chatbots and understanding of language models.

The experimental results show that none of the above personal characteristics can explain the differences in the performance of different interrogators in the Turing test.

Second, the researchers hypothesized that task experience would improve participants' performance. However, the results did not find any evidence that the experimenter had a learning effect.

Testing strategy

The researchers classified the strategies used by interrogators in 300 exchanges. The figure above shows the top 20 strategies that are most used in all communications.

The most common type of strategy is small talk (27%), asking witnesses what they were doing that day and other personal details of their lives. Other common strategies include knowledge and reasoning (23%; including asking general or professional knowledge and current events), and situational awareness (17%; for example, testing the Witness's awareness of communication, time, and surroundings).

Many participants also asked the Witness's views on subjective problems, directly asked if the Witness was a robot, or said something unusual to see how the Witness would react.

To analyze the effectiveness of different strategies, the researchers found out the average accuracy of communication using this strategy (see figure below).

The most likely strategies used in successful communication are to use a non-English language, to ask about time or current events, and to directly ask or accuse the witness of artificial intelligence.

But it is worth noting that the accusation strategy is likely to have no causal relationship with the interrogator's success. The success rate of successfully identifying AI, asking AI witnesses about their mood and opinions, or their awareness of communication, weather and surroundings is low.

When participants explain the reasons for their conclusions, the reasons are divided into six high-level categories: language style, social and emotional, knowledge and content, situational awareness, meta (aspects outside the communication model), and non-informative (vague or irrelevant reasons).

The figure above shows that the interrogator determines that the object is indeed the top 20 most common reasons for artificial intelligence.

The most frequent reasons are linguistic (for example, the model is too informal, overuse makes the character appear reluctant, or the use of repetitive sentence structures) and social emotion (such as the model lacks personality-boring and generic-or produces an "unnatural response".

There are relatively few reasons for artificial intelligence judgment of human witnesses (31), so it is difficult to compare the relative frequency of causes between different types of witnesses.

The following figure shows the frequency of all the reasons for determination and the type of witness.

All the reasons for determining that "AI is AI"

"all the reasons that determine that human beings are AI"

All the reasons that determine that "AI is the human cause" follow a similar pattern:

Interrogators often cite linguistic factors, such as spelling and grammar (SPAG) errors and "human-like" intonation, as well as social and emotional factors such as humor, credibility and non-cooperation.

Interrogators often mistake AI witnesses for human beings because of the informal tone shown by AI (such as slang, acronyms), but for real human witnesses, interrogators seldom think that this is the reason why they judge to be human.

Interrogators believe that the slow response indicates that the object of communication is human, but for this reason, the frequency of identifying the object as human is roughly the same as that of AI, indicating that the delay function is reasonably configured.

Did GPT-4 pass the Turing test? Obviously, GPT-4 did not pass the Turing test.

"GPT-4 does not meet the success criteria of the Turing test, neither achieving a success rate of 50 per cent nor exceeding the success rate of human participants," the researchers said.

If the design is prompted correctly, the GPT-4 or similar model may eventually pass the Turing test.

The challenge now, however, is how to design a subtle cue that mimics the style of human conversation.

In addition, the researchers say the results may not fully accurately reflect the differences between human intelligence and AI systems.

Human performance is underestimated and AI performance is overrated.

The reason for the deviation of the experimental results lies in the structure of the test itself and the judgment criteria of the judges, not necessarily because of the difference in the intelligence level of human or AI system.

In May, researchers from AI21 Labs found in a Turing test study that humans correctly identify other humans with an accuracy of about 73 per cent.

To put it simply, previous studies have shown that humans are more likely to make mistakes in judging whether the other person is human, close to 30%.

On the other hand, this suggests that researchers overestimate the ability of humans to recognize humans.

If AI technology is used to simulate human words and deeds to deceive others in the future, this high error rate may bring some problems.

Another striking feature of why ELIZA beat the results of ChatGPT research is that the success rate of ELIZA is so high that it even exceeds that of GPT-4.

ELIZA is a rule-based chat robot developed in 1966.

ELIZA uses a combination of pattern matching and substitution to generate template responses, interspersed with some user input.

The researchers found that ELIZA successfully deceived human interrogators in 27% of communications, outperforming several GPT-4 witnesses and all GPT-3.5 witnesses.

The researchers analyzed the reasons why ELIZA is human and came to some very interesting conclusions. :

First, ELIZA's response tends to be conservative. While this usually gives the impression of being uncooperative, it prevents the system from providing clear clues such as incorrect information or obscure knowledge.

Second, ELIZA did not show what interrogators thought might be the characteristics of AI, such as helpfulness, friendliness and lengthy responses.

Finally, some interrogators said they thought ELIZA was "too bad" to be the current artificial intelligence model, so it was more likely to be a deliberately uncooperative human.

These results support the claim that the Turing test is not an effective test of intelligence, and this "ELIZA effect" is still powerful even among participants who are familiar with the capabilities of current artificial intelligence systems.

It shows that high-level reasoning in interrogators' decision-making and preconceived notions about artificial intelligence capabilities and human characteristics may distort judgment.

Reference:

Https://arstechnica.com/information-technology/2023/12/real-humans-appeared-human-63-of-the-time-in-recent-turing-test-ai-study/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.