Behind the "wool" of the big models, the basic operation of the industry, standardization is imperative. 07/16 Update SLTechnology News&Howtos

Behind the "wool" of the big models, the basic operation of the industry, standardization is imperative.

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

Recently, the controversy over the use of OpenAI API interface to train large models and the exposure of Google model Gemini to use Baidu Wenxin for Chinese corpus training have aroused a lot of attention and discussion in the industry.

While netizens who do not know the truth are enthusiastically eating melons, they are also lamenting the strange operation of AI manufacturers to "fetch wool" to each other, which is really impossible for everyone.

However, look at the essence through the phenomenon. CTOnews.com believes that these high-profile events may be an opportunity to guide the industry to standardize the use of data copyright in the process of AI model training.

The issue of data copyright in the field of AI is a common problem in the industry, which refers to the "dispute" between byte jumping and OpenAI, and both sides have responded. OpenAI said it needs to further investigate whether there is a violation of the byte beat. The byte jump indicates that the API,4 month of using OpenAI only in the early exploration phase has stopped.

Shortly after the foreign media reported the byte and OpenAI incident, Google's Gemini model was also exposed that the Chinese corpus is trained with Wenxin.

Many users have found that the ZAI Poe platform asked Google's Gemini-Pro model "who are you"? Gemini-Pro replied directly: "I am Baidu Wen Xin big model", and then asked it "who is your founder", he also replied "Robin Li".

At the same time, when the domestic media "qubit" was tested at the entrance of Gemini's official development environment, Gemini-Pro also directly claimed that it used Baidu Wen Xin in the training of Chinese data.

As of the editor's release, Google has not responded to this matter.

However, it can be seen that data copyright infringement in the field of AI has always been a common problem in the industry, and it is also a phenomenon that is difficult to avoid in the early stage of the development of large models.

For example, the editor also noticed that in March this year, Google had been exposed that its Bard chat robot captured data about users' conversations with ChatGPT through the ShareGPT website to train the model.

In addition to Google, Meta, also a tech giant, has recently become embroiled in a data copyright storm over large model training. According to Reuters, comedian Sarah Silverman, Pulitzer Prize winner Michael Chabon and other famous writers jointly filed a lawsuit this summer, accusing Meta of using their books to train artificial intelligence language model Llama without permission.

Meta released its first version of the Llama large language model in February this year and published a list of data sets for training, including the "Books3" section of the "ThePile" dataset. According to the lawsuit, the creator of the dataset said it contained 196640 books, and Meta did so knowing the legal risks of using thousands of pirated books to train its AI model.

Similarly, there is the "victim" OpenAI in this incident. in September this year, 17 famous American writers, including George Martin, the original author of Game of Thrones, accused OpenAI of using their copyrighted works without permission and using them to train large models such as ChatGPT, and to generate similar content to their works.

And in November, OpenAI and Microsoft were sued by a group of non-fiction writers against OpenAI and Microsoft, accusing the two companies of using their books and academic journals without permission and without compensation when training their large language models.

Many cases show that in the early stage of the development of the AI model, the problem of data infringement in the process of model training can be said to be a common problem in the industry, and there is still a great controversy about the use of data in the process of AI training, which needs to be further improved.

What on earth is the "asexual reproduction" of the big model? We know that the basic principle of the AI model is to output the next most likely token (morpheme) based on the above, so how does it ensure that the output is what we want? The answer is training.

Here we will briefly introduce some of the main stages of large language model training: pre-training, supervised fine tuning and human feedback learning.

In the pre-training stage, there is no need for human intervention. As long as AI is fed with enough data, AI can acquire a strong universal language ability through training.

Next, in the step of supervised fine tuning, we need to solve the problem of getting the big model to output the results we want.

For example, when we ask, "what is the boiling point of water?" AI may feel that there are many types of responses to this question, such as "I'd like to know, too", but for human beings, the most reasonable response is "100 degrees".

So we need humans to guide AI to output standard answers that we think are reasonable. In the process, we will artificially feed AI a large number of standard answers to questions to fine-tune its model parameters, so it is called supervised learning. There are many similar situations, for example, we do not want large models to output content that does not conform to human values, all of which need to be fine-tuned, in other words, to label the data we want.

It is conceivable that data tagging is a very massive and huge project, which requires a lot of manpower and time. In the environment of business competition against the clock, for enterprises that later entered the field of large models, it is obviously not in line with the needs of development to accomplish these things alone and repeatedly. Therefore, it is an open secret for many large models to use GPT to generate tagged data.

For example, some domestic GPT mirroring stations are completely free, that is, some companies spend their own money to call the OpenAI interface and then use the user as a labor force to generate training data.

For example, the well-known open source dataset Alpaca is also generated with GPT4. This method of training small models with GPT tagged data is also called "distillation".

After the ChatGPT explosion, many companies were able to follow up and launch their own AI model so quickly, in fact, there are mainly two paths.

One is to use Meta's large open source language model Llama to train.

The second is to distill some data in ChatGPT, and then train your own big model by combining open source data sets and your own climbing data.

Therefore, although OpenAI has a provision in its API terms of service that "you can't use Output to develop a model that competes with OpenAI", this policy has always been controversial.

Supporters believe that OpenAI has made a lot of upfront investment in the training model, and it is incorrect to take shortcuts with their services. Opponents believe that the early training process of OpenAI has eaten the unguarded dividend of the external environment in the early stage of AI training, and there are also complaints of data infringement, so it is difficult for subsequent models to obtain training data of the same magnitude and scale, preventing other enterprises from invoking their models against the spirit of "Open".

In this context, let's take a look at the byte-beating response:

At the beginning of this year, when the technical team first began to explore large models, some engineers applied GPT's API services to pilot projects with smaller models. The model is only for testing, is not planned to go online, and has never been used for external use. This practice has been stopped since the company introduced GPT API invocation specification checks in April.

As early as April this year, the byte large model team made a clear internal requirement not to add data generated by the GPT model to the byte large model training data set, and to train the engineer team to abide by the terms of service when using GPT.

In September, another round of inspections was conducted within the company, and measures were taken to further ensure that the API calls to GPT met the requirements of the specification. For example, the similarity between the training data of the batch sampling model and GPT is avoided, which avoids the private use of GPT by data tagging personnel.

In the next few days, we will conduct another comprehensive inspection to ensure strict compliance with the terms of use of the relevant services.

For the response to the byte jump, the editor wants to refine two key points. First, in the early days of exploring the large model, some engineers apply GPT's API service to the experimental project research of the smaller model, while the experimental project does not violate the terms of service. For example, Microsoft has also used OpenAI composite data to do fine-tuning training, training a 13 billion-parameter model Orca, but also reached the level of chatGPT 3.5. Like the byte jump, this is also the use of experiments and research, and the model has not been put into commercial use.

Second, byte beat has clearly pointed out in its response that they have repeatedly made norms and restrictions internally that they cannot use GPT to generate data training models. In fact, this is not only to abide by the terms of service, but also the necessity of technological development, because if you keep using Open AI model output, it appears to be a shortcut, but in fact, it is equivalent to locking your own large model capability ceiling. Byte jumps must be clearer than anyone, regardless of the model itself, the training data, or the output method, which is only a continuation of GPT.

The core copyright issues in AI model training need to be standardized and improved urgently. In fact, there will be all kinds of chaos and non-compliance problems in any emerging industry at the initial stage of development. The development of things is always a process, and the intervention of standards and norms often occurs under a suitable opportunity after the industry development law is fully presented.

Therefore, in the case of this byte jump and the successive events of OpenAI, Google Gemini and Wen Xin, the editor believes that instead of dwelling on "right or wrong" in the dispute, what should be more noteworthy is whether it is time to further standardize and improve the industry norms for the use of data in the AI field.

According to recent data from Sadie Research Institute of the Ministry of Industry and Information Technology, China's generative artificial intelligence market is expected to exceed 10 trillion yuan this year. Experts predict that generative artificial intelligence is expected to contribute nearly 90 trillion yuan to the global economic value in 2035, of which China will exceed 30 trillion yuan, accounting for more than 40%.

On the one hand, the development momentum of generative AI is in full swing, on the other hand, the problem of large model training is at the beginning of the life cycle of generative AI. If it can not be regulated as soon as possible from the source, the research and development of AIGC large model will always be in a state of infringement and uncertainty. This is obviously disadvantageous to the development of the industry.

At the same time, it should be noted that there are many difficult problems in the field of generative AI training, such as subjects, conditions, feasibility and so on. For example, the amount of data in AIGC training is too large, and the sources are different. If you use the way of prior authorization, it is difficult to separate and extract specific works from massive data, coupled with a series of operations such as copyright definition, payment and so on. It's almost impossible. In other words, the problem of data infringement in the AI era is a challenge to the existing copyright laws and norms itself, and there are many places that need to be perfected from the beginning, but they cannot be imperfect, so the standardization system must be promoted as soon as possible.

The good news is that this problem is getting the attention of the industry. For example, in June this year, 26 units, including Chinese online, Tongfangzhi, and China Workers Publishing House, jointly issued the first proposal on the copyright of AIGC training data in China, and put forward proposals aimed at guiding the fair use of content generated by AI, raising awareness of copyright protection, and optimizing content licensing channels.

At the same time, we also hope that this byte jump and the incident of OpenAI and Gemini and Wen Xin will also become an opportunity to promote the standardization of the core copyright issues of generative AI training data, from "initiative" to actual "landing".

Only in this way, generative AI can better serve human beings and serve all walks of life.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.