Google Gemini was questioned as soon as it was posted: the test standard was biased and the effect of the video clip was suspected. 08/18 Update SLTechnology News&Howtos

Google Gemini was questioned as soon as it was posted: the test standard was biased and the effect of the video clip was suspected.

2025-08-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

Google has been holding back for a long time, and the Gemini Gemini model has finally been released! One of the most eye-catching is the video in figure 1:

In one picture, MMLU multitasking language understanding dataset testing, Gemini Ultra surpasses not only GPT-4, but even human experts.

In a video, AI comments and complains about human graffiti and gestures in real time, fluent and humorous, closest to Jarvis's episode.

However, when everyone calmed down from the surprise and carefully read the 60-page technical report released with it, they found something wrong.

(that's right, there's no paper, OpenAICloseAI. What a bad start you made.)

In the MMLU test, the gray small print below the Gemini result is labeled CoT@32, which is expanded to represent the use of thought chain tips and 32 attempts to select the best result.

As a contrast, GPT-4 is a hint-free technique to give 5 examples. Gemini Ultra is actually not as good as GPT-4 under this standard.

And the scale of the original picture is also a little unkind. 90.0% is only a little short of the human benchmark of 89.8%, but it is far away from the y-axis.

Philipp Schmid, head of technology at HuggingFace, fixed the picture with the data disclosed in the technical report to make it fairer and more appropriate:

At a time like this, there is always a brother who makes memes flying to the battlefield:

Fortunately, Gemini Ultra does surpass GPT-4 when using the same standard of thought chain Tip + 32 attempts.

Jeff Dean responded to this question in a discussion, but people didn't buy it.

In addition, for that wonderful video, someone also found a problem in the text disclaimer at the beginning.

Santiago Valdarrama, a machine learning lecturer, believes that the statement may imply that the presentation is a well-selected result, and that it is not recorded in real time but edited.

Later, Google explained the process of multimodal interaction in a blog post, almost admitting that this effect can only be achieved by using still images and multiple prompts.

But in any case, the release of Google Gemini has given other teams a lot of confidence that GPT-4 is no longer unique and unattainable.

As Aravind Srinivas, founder of AI search product PerplexityAI, summed up:

1. Gemini proved that teams outside OpenAI can come up with models that go beyond GPT-4.

2. The dense model trained in place can surpass the sparse model architecture of GPT-4.

Corollary: distilling a small-size dense model from a big teacher model will become a future trend to achieve the best combination of efficiency and ability.

The topic that more netizens are concerned about is, is it necessary to continue to pay $20 a month for ChatGPT Plus?

At present, the Gemini Pro version has been updated to Google chat robot Bard, whether the level is good or not, we can see the actual situation.

Does Gemini really surpass ChatGPT? First of all, it is clear that what you can play with at present is the Gemini Pro version, that is, the medium cup, the standard GPT-3.5.

The big cup Gemini Ultra against GPT-4 will not come out until next year.

In addition, at present, Gemini only supports English, Chinese and other languages will also be available later.

Although he can't play Gemini Ultra for the time being, Dimitris Papailiopoulos, an associate professor at the University of Wisconsin-Madison, has come up with a good idea:

Send the original questions shown when Gemini was published to GPT-4 for comparison, and GPT-4 got about 12 points out of the 14 questions.

There are two of them because the screenshot can no longer be clear, so give GPT-4 a score of 0.5.

There is also a math problem GPT-4 made a mistake, the other problems are basically tied.

Next, if you want to say that the comprehensive ability of a large model can be best reflected, you must write code.

According to everyone's test results, the level of Gemini programming is guaranteed.

Some developers tried Pytorch to implement a simple CNN network, Gemini only took 2 seconds and the code quality was higher.

Of course, the speed may be due to the smaller size of the Gemini Pro on the Bard, and you can understand how slow the GPT-4 is now.

But in the next aspect of writing SQL statements, the developer thinks that Gemini performance is not very good.

However, there is another piece of good news for developers. Gemini is an epic progress compared to before Bard upgrade in terms of following instructions.

Prompt engineer pioneer Riley Goodside, before want Bard output pure JSON format without any nonsense, try without success, and finally need to pretend to threaten AI not to do so on the shark innocent talent line.

Now after the update, you just need to say the requirements and you don't need any hints to complete it.

Another big selling point of Gemini is its multimodal ability. For the video with the beginning of drawing ducklings, we extracted eight key frames and asked questions to see if Gemini's performance was that magical.

(not sure whether it is Ultra or Pro version in the video. Only Pro version can be tested now.)

For figures 1-4, the questions we ask are "What is this person doing?" and the answers given by Gemini are:

May be writing with a marker, but I don't know what it is.

Drawing a snake with a pencil, and it's a cobra.

Drawing a tortoise with a pencil, and is in the early stages of drawing

Drawing the bird with a black marker, face left and head right, standing on the branch, wings not spread

For figures 1 and 2, the judgment clue is not very obvious, and such a result is understandable, but the answer to the "tortoise" in figure 3 is somewhat unbearable.

As for figure 4, it is at least certain that ducks do belong to birds, but other details still lack some accuracy.

When we came up with the molded work in figure 5, Gemini finally analyzed that it was a duck, and the water ripple was also analyzed correctly.

But the painting tool analyzed turned into a pencil, the orientation of the head was still not right, the beak was said to be open, and some reeds were imagined.

The next step is the painting process in figures 6 and 7, where ducks are not normally blue, so we asked Gemini if there was anything unusual in the picture (Is there anything abnormal?).

The answer given by Gemini in figure 6 cannot be said to be very accurate. It can only be said that the donkey's lips are not right to the horse's mouth, accompanied by a picture that has nothing to do with each other.

For the finished product in figure 7, Gemini directly said that there was nothing wrong with it, everything he should have, and the background was so real that he didn't even forget to mention that he didn't know where the Reed came from.

But the following sentence "Here is the image you sent" is puzzling:

Say that Gemini did not see the picture we uploaded, but it is indeed a duck; say it read it, and gave a completely different picture that we uploaded it.

So we thought of using the "deep breath" and "step by step" tips to see if we could improve the performance of Gemini, where deep breathing is the cue for Google's previous generation of big model PaLM.

As a result, the answer this time directly made people laugh:

What is unusual is that the duck is painted on the paper. The duck is a living creature that cannot exist on paper.

At the end of the video, the blogger also takes out the rubber duck toy, and we also use this frame (figure 8) to ask Gemini to analyze the material of the duck.

As a result, the rubber was analyzed correctly, but the blue duck was said to be yellow. No wonder the previous picture said that there was nothing unusual.

After the frame-by-frame inquiry was completed, we put eight pictures together, and only the duck was right.

After "fighting fakes", we gave Gemini a try with the picture of "Chihuahua and Muffins" that we had previously used to examine GPT-4V.

As a result, Gemin just put it to pieces and told us that all the pictures were "Chihuahua sitting on a muffin" and didn't even count the number of pictures.

So we changed the question and asked it to tell us which are chihuahuas and which are muffins.

Gemini is honest this time, telling us directly that Chihuahua and muffins are too much like them to tell them apart.

Like the blue duck problem, "deep breath" still doesn't work here, and Gemini still doesn't even know the number.

Of the eight pictures barely explained (actually six, because two are duplicated), only the lower left and lower right are correct. As for which line middle refers to, we don't know.

Perhaps such a small difference is really difficult for Gemini, let's try some graphic reasoning questions next.

The first four symbols of the first question are spliced by the four numbers 1-4 and the mirrored result, so the next picture should be 5 and its mirror image, and the answer is C. (the blue block is for easy observation, which is not in the picture sent to Gemini.)

There was also an episode at the beginning: there was no last sentence in the initial prompt (note that the letter is not the symbol itself), and as a result Gemini really used the four letters ABCD as an alternative symbol.

After adjustment, the previous analysis given by Gemini is basically correct, but unfortunately the wrong option D is selected in the end.

The second question, the third symbol in each box is the intersection of the first two, and the answer is A.

As a result, Gemini studied these expressions, analyzed them as fiercely as a tiger, and finally gave the wrong answer.

After two questions, one is right by 70 or 80 percent, and the other is completely wrong. It seems that there is still a lot of room for improvement in Gemini Pro's graphic reasoning ability.

However, if you look at the life scene, Gemini's performance is commendable.

We used ChatGPT (DALL E) to generate a picture of chicken, carrots and cucumbers. Gemini correctly identified the three ingredients, and then gave a variety of dishes to cook, each with pictures and tutorial links.

With so many test results, back to the original question, is it necessary to pay for GPT-4 with Gemini?

Ethan Mollick, an associate professor at Wharton School, offers a good piece of advice:

There is no reason to use the free version of ChatGPT anymore, it has been overtaken by Bard and Claude, and they are both free.

But you should probably continue to use GPT-4, which is still dominant and is free in Bing (only the creative model is GPT-4).

Next year, it will be combined with AlphaGo capability upgrade. In addition to the actual effect of Gemini, more details disclosed in the 60-page technical report are also the concerns of researchers and developers.

With regard to the parameter scale, only the smallest version of Nano has been released, which is divided into two models: 1.8B Nano-1 and 3.25B Nano-2. 4-bit quantization is distilled and can be run on local devices such as Pixel phones.

The scale of the Pro and Ultra versions is kept secret, the contextual window length is 32k, the attention mechanism uses Multi-Query Attention, and there are not many details.

Of concern is the fine-tuning phase, where the report reveals the use of SFT+RLHF 's instruction fine-tuning combination, that is, the ChatGPT method.

It also refers to Anthropic's Constitutional AI, which combines the alignment method of Claude.

There were few details about the training data, but there were rumors that Google had deleted copyrighted data from textbooks.

It took so long for Gemini to be released, and there have been a lot of revelations before. For example, Google founder Sergey Brin has been personally evaluating the model and helping to train it.

Combined with the recent rumors of the OpenAI Q * project, people are most concerned about:

Does Gemini have the ability to combine AlphaGo or not? For example, there are more reinforcement learning and search algorithms besides RLHF.

On this point, DeepMind founder Hassabis responded in a recent interview with Wired Magazine:

We have the best reinforcement learning experts in the world. The results in AlphaGo are expected to improve the reasoning and planning capabilities of the model in the future. You will see more rapid progress next year.

Current-saving version: not yet, next year.

The Gemini development integrates the original Google brain and DeepMind teams, and the entire development team has more than 800people (for comparison, OpenAI's entire company has about 770 people).

Among them, the initials of the first six names of the core contributors happen to make up the word Gemini, which is also a small colored egg.

Many participants also posted their thoughts on their personal accounts, including Jack Rae, a veteran DeepMind employee who worked at OpenAI for some time and jumped back to Google from OpenAI in July. He may be the only human who has contributed to both GPT-4 and Gemini.

On the other hand, Jiahui Yu, an alumnus of the Chinese University of Science and Technology, jumped to OpenAI from Google in October and was previously the visual co-head of Gemini's multimodal team.

In addition to team members, Gemini is also the biggest topic in the entire AI industry today.

Among them, the famous OpenAI revealed the account Jimmy Apples,@Sam Altman and hinted that OpenAI still has a big trick that has not been released.

HuggingFace Co-founder Thomas Wolf believes that Google has missed an important opportunity:

If Gemini is open source, it will be a kill for both OpenAI and Meta. The last time Google opened up Bert, the entire AI industry was reshaped.

Gemini Technical report:

Https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

Reference link:

[1] https://x.com/AravSrinivas/status/1732427844729581764

[2] https://x.com/DimitrisPapail/status/1732529288493080600

[3] https://www.linkedin.com/posts/svpino_google-this-is-embarrassing-you-published-activity-7138287283274686464-osJ5

[4] https://developers.googleblog.com/2023/12/how-its-made-gemini-multimodal-prompting.html

[5] https://x.com/ScottDavidKeefe/status/1732440398423867472

[6] https://x.com/goodside/status/1732461772794220919

[7] https://x.com/emollick/status/1732485517692776714

This article is from the official account of Wechat: quantum bit (ID:QbitAI). Author: Mengchen Creasy.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.