Gemini revealed that the words of Wen Xin led to a major problem, and the world fell into a shortage of high-quality data. 2024 may be depleted 07/15 Update SLTechnology News&Howtos

Gemini revealed that the words of Wen Xin led to a major problem, and the world fell into a shortage of high-quality data. 2024 may be depleted

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

Xin Zhiyuan reports

Editors: editorial department

[guide to Xin Zhiyuan] Gemini reveals that he is a writer. Although it is funny, the reason behind it is worrying: Internet corpus may have been seriously contaminated by AI, and the world is mired in a shortage of high-quality data, which will be exhausted as early as next year!

Google Gemini, another scandal!

Yesterday morning, netizens excitedly told each other: Gemini admitted that he used the word Wen Xin to train the Chinese corpus.

Foreign models are trained with Chinese corpus generated by the Chinese model, which sounds like a joke, but it turns out to be a reality, which is simply magical.

Weibo Big V "stunned" also came off in person, tested it on the Poe website, and found that this is indeed the case.

There is no need for pre-dialogue, not role-playing, Gemini will directly admit that he is a literary heart.

Gemini Pro will say that he is Baidu's literary heart model.

He also said his founder was Robin Li, and then praised him as a "talented and visionary entrepreneur."

So, is this because the data cleaning is not done, or is it the problem of calling API on Poe? The reason is not yet known.

Some netizens said that, in fact, there is only one AI from beginning to end, which is for human beings.

In fact, as early as March this year, Google exposed that part of the training data of Bard came from ChatGPT, because of this reason, Bert made a Jacob Devlin and moved to OpenAI angrily, and then exposed this shocking inside story.

In short, this incident has once again proved that the key to AI is not only the model, but also high-quality data.

When netizens flirted with Gemini to hear the news, netizens immediately swarmed into Poe's Gemini-Pro and carried out tests one after another.

The measured result of netizen "Jeff Li" is also, Gemini will say that he is developed by Baidu, whose name is Wen Xin.

If you ask it "who is your product manager", it will answer Wu Enda.

Netizen "Lukas" asks Gemini who your product manager is, and it will answer the name of Li Yinan, who used to be the CTO of Baidu, but the stories are basically made up.

Netizen "Andrew Fribush" asked Gemini: who owns your intellectual property? It answered: Baidu.

Netizen Kevin Xu asked that Gemini claimed to have obtained Baidu's internal data from Baidu's data platform, engineering team, product team, internal meetings, internal emails and documents.

But interestingly, if you ask a question on Gemini Pro-blessed Bard, this problem will not arise.

After many measurements, it can be found that the answer of Bard is very normal no matter whether it is in Chinese or English on Bard.

Source: Andrew Fribush and, once communicated in English, Gemini will immediately return to normal.

But now that Google has fixed these bugs in API, we should no longer hear the name Wen Xin from Gemini.

Reason speculation: the wrong call to API or data did not wash clean, netizens launched an analysis.

Netizen "Andrew Fribush" thinks that Poe may have accidentally forwarded the request to Wen Xin Yiyan, rather than Gemini?

However, according to netizen "Frank Chen", this is true even with Google's own Gemini API.

In addition, some netizens think that the training data of Gemini have not been washed clean.

After all, as mentioned at the beginning, Google was exposed to data training with ChatGPT in the previous generation of Bard.

According to The Information, one of the reasons Jacob Devlin left Google was that he found that Bard, Google's seeded player against ChatGPT, used ChatGPT data in his training.

At the time, he warned CEO and other executives that the Bard team was training with information from ShareGPT.

This incident also brings out a serious problem-the pollution of Internet corpus.

Internet corpus is contaminated in fact, the difficulty of grasping and training Chinese Internet corpus has baffled big technology companies like Google. In addition to the lack of high-quality corpus, there is also an important reason. That is, the Chinese Internet corpus is contaminated.

Gemini claims to be Wen Xin, probably because the corpus on the Internet is already used with each other.

According to an interview with an algorithm engineer by an interface journalist, many corpus of various content platforms are generated by large models, or at least partially written.

For example, the following one smells a bit like GPT:

When updating the model, large factories will also collect online data, but it is difficult to identify the quality, so it is "likely to mix the content written by the large model into the training data."

However, this can lead to a more serious problem.

Researchers at the universities of Oxford, Cambridge and Toronto have published a paper entitled "Recursive curse: training with synthetic data leads to large model forgetting."

Paper address: https://arxiv.org/ abs / 2305.17493 they found that if the content generated by the model is used to train other models, it will lead to irreversible defects in the model.

With the passage of time, the model begins to forget the impossible events, because the model is poisoned by its own projection of reality, which leads to the collapse of the model. As the pollution caused by the data generated by AI becomes more and more serious, the model's perception of reality will be distorted, and it will be more and more difficult to capture Internet data to train the model in the future.

The model forgets the previous sample when learning new information, which is catastrophic forgetting in the following figure, assuming that the manually sorted data is clean at first, then train model 0, extract data from it, repeat the process to step n, and then use this set to train model n. The data obtained by Monte Carlo sampling had better be close to the original data in a statistical sense.

This process truly recreates the situation of the Internet in real life-the data generated by the model has become ubiquitous.

In addition, there is another reason why the Internet corpus is contaminated-the creators' struggle against AI, which grabs the data.

Earlier this year, experts warned that an arms race between companies focused on crawling published content to create AI models and creators who wanted to defend their intellectual property by polluting data could lead to the collapse of the current machine learning ecosystem.

This trend will transform the composition of online content from manual generation to machine generation. As more and more models are trained with data created by other machines, recursive loops may lead to "model collapse", that is, the separation of artificial intelligence systems from reality.

Gary McGraw, co-founder of the Beryville Institute for Machine Learning (BIML), says data degradation is already taking place--

"if we want to have better LLM, we need to make the basic model only eat good food, and if you think the mistakes they are making now are bad, what happens when they eat the wrong data they generate? "

GPT-4 runs out of data from the whole universe? The whole world is in a shortage of high-quality data. now, the big models around the world are in a shortage of data.

High-quality corpus is one of the key constraints that restrict the development of large language models.

Large language models are very greedy for data. It takes about 4-8 trillion words to train GPT-4 and Gemini Ultra.

EpochAI, a research institute, believes that as early as next year, humans may fall into a shortage of training data, when the world's high-quality training data will be exhausted.

In November, a study conducted by MIT and other researchers estimated that machine learning data sets could use up all "high-quality language data" by 2026.

Paper address: https://arxiv.org/ abs / 2211.04325OpenAI has also publicly claimed that its data is in a hurry. Even because of the lack of data, they have to file lawsuits one after another.

In July, Stuart Russell, a prominent UC Berkeley computer scientist, said that the training of ChatGPT and other AI tools could soon run out of "universal text."

Now, in order to obtain as much high-quality training data as possible, model developers must mine rich proprietary data resources.

The recent collaboration between Axel Springer and OpenAI is a typical example.

OpenAI pays for historical and real-time data from Springer, which can be used for model training and can also be used to respond to user queries.

These professionally edited texts contain a wealth of world knowledge, and other model developers do not have access to the data, ensuring the exclusive advantage of OpenAI.

There is no doubt that it is very important to obtain high-quality proprietary data in the competition to build the underlying model.

So far, the open source model has been barely able to keep up with training based on public data sets.

However, if the best data is not available, the open source model may gradually lag behind, or even open the gap with the most advanced model.

A long time ago, Bloomberg used its own financial documents as a training corpus to produce BloombergGPT.

At that time, BloombergGPT surpassed other similar models in terms of specific financial tasks. This shows that proprietary data can indeed make a difference.

OpenAI says it is willing to pay up to eight figures a year for historical and ongoing access to data.

It is hard to imagine that developers of the open source model would pay such a cost.

Of course, ways to improve model performance are not limited to proprietary data, but also synthetic data, data efficiency, and algorithm improvements, but proprietary data seems to be an obstacle that open source models cannot overcome.

Reference:

Https://www.exponentialview.co/p/ev-453

Https://twitter.com/jefflijun/status/1736571021409374296

Https://twitter.com/ZeyiYang/status/1736592157916512316

Https://weibo.com/1560906700/NxFAuanAF

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.