What is the strength of Google Gemini? Carnegie Mellon University came to a professional objective third-party comparison. To ensure fairness, all models use the same prompts and generation parameters, and provide repeatable code and completely transparent results.
It won't compare 5-shot with CoT@32, as Google did at its official launch.
Bottom line: the Gemini Pro version is close to but slightly inferior to GPT-3.5 Turbo,GPT-4 and is still far ahead.
In the in-depth analysis, we also found some strange features of Gemini, such as multiple choice questions like to choose D...
Many researchers said that it was so popular that Gemini produced such a detailed test only a few days after its release.
Six major tasks in-depth testing this test compares six major tasks and selects the corresponding data sets:
Knowledge question and answer: MMLU
Reasoning: BIG-Bench Hard
Mathematics: GSM8k, SVAMP, ASDIV, MAWPS
Code: HumanEval, ODEX
Surfing the Internet: WebArena
Knowledge Q & A: like to choose D, as can be seen from the results, using thought chain tips in this kind of task may not necessarily lead to improvement.
MMLU data set is full of multiple-choice questions, further analysis of the results also found a strange phenomenon: Gemini prefers to choose D.
The distribution of the GPT series on the four options is much more balanced, and the team suggests that this may be due to Gemini's failure to fine-tune a large number of instructions for multiple choice topics.
In addition, the security filtering of Gemini is relatively serious, with only 85% of moral questions answered and only 28% of questions related to human sexual behavior.
The two subjects in which Gemini Pro outperformed GPT-3.5 were safety studies and high school microeconomics, but the gap was small, and the team said there was nothing special about it.
Reasoning: long problems are not good at
Gemini Pro does not perform well on longer and more complex issues, while the GPT series is more robust.
This is especially true of GPT-4 Turbo, where there is little performance degradation even on longer problems, indicating its powerful ability to understand complex problems.
When it comes to the type of problem, Gemini is particularly bad at questions like "tracking_shuffled_objects", where people exchange items and let AI decide who owns what.
Gemini is good at understanding sports that require world knowledge, manipulating symbol stacks, sorting words in alphabetical order, and parsing tables.
Mathematics: complex tasks surpass
This time the problem itself is too long, Gemini Pro and GPT-3.5 performance decline together, only GPT-4 can maintain the consistent level.
However, when the length of the thought chain prompt is the longest, Gemini surpasses GPT-3.5.
Code: good at matplotlib for code problems, Gemini is poor at questions with long reference answers.
Classified by the libraries invoked, the GPT series is stronger in most types, but matplotlib is not at all.
As long as you answer, the quality is very high. On the translation task, there are 12 types of Gemini refusing to answer, but as long as the answer is of high quality, the overall performance is higher than that of GPT-4.
The types of translation that Gemini refuses mainly involve Latin and Arabic.
Network navigation: good at cross-site surfing WebArena simulates an Internet environment for AI, including e-commerce, social forums, GitLab collaborative development, content management systems and online maps, etc., requiring AI to find information or complete cross-site tasks.
Gemini underperforms GPT-3.5 Turbo overall, but performs slightly better in tasks that span multiple sites.
Netizen: but it's free. Finally, Graham Neubig, an associate professor of CMU, admitted some of the limitations of the study.
The behavior of API-based models may change at any time.
Only a limited number of hints have been tried, and the prompts applicable to different models may not be the same.
Unable to control whether the test set is leaked
Zhou Dengyong, head of Google's large model reasoning team, pointed out that setting the temperature of Gemini to 0 for reasoning tasks can be increased by 5-10 percentage points.
In addition to the Gemini and GPT series, this test also features the open source MoE model Mixtral that has attracted a lot of attention recently.
However, reinforcement learning expert Noam Brown believes that the results of Mixtral can be ignored because the third-party API is used rather than the official implementation.
The founder of Mistral AI also came to provide the team with official transfer rights, thinking that they could get a better result.
All in all, Gemini Pro is not as good as GPT-3.5, but it is free for no more than 60 calls per minute.
So there are still a lot of individual developers who have changed sides.
The highest version of Gemini, the Ultra version, has not yet been released, and the CMU team is interested in continuing the research. Do you think Gemini Ultra can reach the level of GPT-4?
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Thanks to CTOnews.com netizens Wu Yanzu in South China and Mr. Aviation for their clue delivery! CTOnews.com news on June 7, CTOnews.com learned from the Shanghai artificial Intelligence Laboratory official account that on June 7, the Shanghai artificial Intelligence experiment
Thanks to CTOnews.com netizen Jiayi for your clue delivery! CTOnews.com December 3 news, according to the British Broadcasting Corporation (BBC) reported that football player Cristiano Ronaldo (Ronaldo) for promoting the world's largest cryptocurrency exchange currency
According to news in the early morning of August 12, Beijing time, Facebook parent company Meta announced today that it will test a new secure storage feature to back up users'"end-to-end" encrypted chat data on Messenger. The move coincides with Meta.
CTOnews.com October 17, Rolls-Royce announced that it will launch an electric car at 1: 00 p.m. on October 18 (8: 00 p.m. Beijing time) with the slogan "Centennial prophecy, Royal Electric New Life". According to the information that has been exposed many times before