Microsoft puts forward the chameleon framework, which makes the model open and hang with its own toolbox, and the accuracy of mathematical reasoning task is 98%. 08/18 Update SLTechnology News&Howtos

Microsoft puts forward the chameleon framework, which makes the model open and hang with its own toolbox, and the accuracy of mathematical reasoning task is 98%.

2025-08-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

Teaching big model invocation tools has become one of the most popular topics in AI circles. Well, there's another study published in the latest NeurIPS 2023-

It is a framework called Chameleon (chameleon), which claims to be a toolkit for turning large language models directly into wizards, from Microsoft and the University of California, Los Angeles (UCLA).

Compared with other models, Chameleon can use a variety of tools, including large language models, visual recognition models, web search engines, Python programming functions and rule-based modules.

On the other hand, the performance is better.

In the scientific question and answer task ScienceQA and the tabular mathematical reasoning task TabMWP, the accuracy of Chameleon reached 86.54% and 98.78% respectively, which significantly exceeded the current best model set in both fields.

In fact, before being included in the top meeting, the Chameleon reasoning framework has become the focus of the technology community. In just half a year, its project on GitHub has won nearly 1000 star marks, and nearly 100 academic papers have cited this framework.

Among the many papers related to AI, Chameleon stands out and is named the best paper of the week by AlphaSignal in 1682 articles.

In addition, the in-depth interpretation of Chameleon video on YouTube has also attracted more than 10,000 views.

Let's take a look at what kind of framework this is.

Chameleon inspiration in practical applications, we are often faced with different tools of various types and fields, such as open source models from Hugging Face and GitHub, web search services like Google and Bing, knowledge bases such as Wikipedia, generative artificial intelligence models, Python functions, language translation and image generation, and so on.

A compelling question is:

How to combine these various tools with large language models to solve complex tasks.

The answer lies in tool enhanced (Tool-Augmented) large language models or large language model agents (LLM Agent)!

By planning and integrating multiple tools and resources into a large language model framework, you can create a more versatile and powerful system to handle complex tasks in various areas.

Therefore, the researchers of Microsoft and UCLA proposed the Chameleon- chameleon reasoning framework.

Chameleon is inspired by chameleons in nature, just as chameleons can adapt to their surroundings by changing the color of their skin. Chameleon models can combine and use different tools to complete complex reasoning according to different input problems.

For example, when solving multimodal task ScienceQA, the Chameleon model will generate different programs for different problems to flexibly combine various tools and execute them in a certain order to get the answer. This flexibility and adaptability make Chameleon a powerful tool for solving complex tasks.

Compared with Chameleon model and related work, Chameleon model has significant advantages in tool diversity and invocation flexibility.

First of all, Chameleon supports LLM model, visual model, web search engine, Python function and rule-based module, which can communicate with each other through natural language.

In contrast, existing work such as Toolformer only supports a small number of tools, such as question and answer, calculator, machine translation, WikiSearch and calendar query, while HuggingGPT is only suitable for visual processing-related models.

Second, the Chameleon model allows call combinations of different tools to be generated in a manner similar to natural language, without the need to design programs in complex formats.

In existing work, such as ViperGPT, it is necessary to generate well-designed Python code that conforms to a specific format, which is not friendly to users with limited programming skills.

The difference between the Chameleon model of tool planner based on LLM and previous methods is that it can synthesize the combination of various tools to adapt to different types of reasoning problems.

The model consists of two main components: toolkit (Module Inventory) and LLM planner (LLM Planner). The toolbox contains a variety of tools, which make the Chameleon model have the ability of diversity and multi-dimensional reasoning.

LLM planner is implemented based on large language model, and can generate programs in natural language form according to different input problems, so as to combine and invoke the tools in the toolbox.

The implementation of LLM planner is very simple and efficient, making full use of the cue learning (Prompt Learning) and context learning (In-Context Learning) capabilities of large language models.

The input prompt for the LLM planner describes the situation in which you need to generate a sequence of different tool combinations, while defining all the tools in the toolbox.

The hint of the LLM planner also provides some contextual examples to guide large language models on how to generate the correct program based on input information.

Based on these descriptions and examples, large language models, such as ChatGPT and GPT-4, can learn how to generate appropriate programs for new input problems to combine and invoke different tools in the toolbox to complete input problems involving complex reasoning.

One of the advantages of Chameleon model is that it provides users with rich flexibility. By providing language descriptions, large language models can work together with external tools to cover a variety of types and skill dimensions.

In addition, it has plug and play features, allowing users to seamlessly update the underlying large language model, add new tools, and adapt to new tasks.

Chameleon toolbox has a variety of skills to meet a variety of reasoning needs. Chameleon's toolbox contains tools with different skills, including image understanding, knowledge understanding, mathematical reasoning, tabular reasoning and question answering.

The implementation of LLM-based tools needs to be emphasized that Chameleon's toolbox includes tools based on LLM (large language models).

Take the knowledge Retrieval (Knowledge Retrieval) tool as an example.

When helping the system solve complex problems, it is very important to retrieve additional knowledge. This tool module takes advantage of the powerful generation ability of large language models to acquire domain-specific knowledge.

This is particularly useful when dealing with professional issues, such as science and mathematics.

For example, if the problem involves understanding the tax table, this module can generate tax-related background knowledge, which is essential for subsequent reasoning steps.

Recent studies have shown that program-aided methods can improve the logical and mathematical reasoning ability of large language models.

Therefore, the toolbox also includes a "program generation (Program Generator)" tool, which uses the contextual learning and code generation capabilities of large language models, combined with input problems, to generate Python programs that can effectively solve a given problem.

In addition, you can build a solution generation (Solution Generator) tool, which can guide large language models to make full use of input questions, context information, and intermediate results executed by history tools to generate multi-step and detailed answers.

The evaluation of the Chameleon model shows that the Chameleon model is evaluated experimentally on two complex multimodal reasoning tasks, namely ScienceQA and TabMWP.

ScienceQA, or Science Q & A, is a multimodal Q & A benchmark covering a wide range of scientific topics.

As the example in the figure below shows, answering questions in ScienceQA requires the use of a variety of knowledge, tools, and skills, such as image description, text detection, knowledge retrieval, online resource search, and visual reasoning.

This requires the model to have the ability of combination including visual and linguistic reasoning.

The LLM planner in the Chameleon model can synthesize programs to invoke different combinations of tools to answer different types of questions in ScienceQA.

For example, in the first example shown in the following figure, the Chameleon model recognizes that the input image contains advertising text, so the text Detection (Text Detector) tool is invoked to understand the text in the image.

The model then invokes the knowledge Retrieval (Knowledge Retrieval) tool to retrieve background knowledge about the term "persuasive appeal" involved in the problem.

Finally, the model obtains the final answer based on the intermediate results obtained by the input question and the tool before execution.

The second question involves identifying animals in the image and answering questions about environmental adaptability.

The Chameleon model calls the "image description (Image Captioner)" tool to understand the animals in the image, and obtains the relevant subject background knowledge by calling "Bing search (Bing Search)". The final answer makes full use of this information.

The detailed evaluation results also fully prove the effectiveness of the Chameleon model in the ScienceQA task.

Chameleon model also shows its excellent flexibility and effectiveness in the tabular reasoning task TabMWP.

TabMWP is a mathematical reasoning task based on table context, which requires models to understand various forms of tables and perform accurate numerical calculations.

In the first example in the following figure, mathematical reasoning is involved in counting tables. The Chameleon model invokes the knowledge retrieval (Knowledge Retrieval) tool to understand how to calculate the median of a list. Then, it relies on program aids for accurate calculation.

The second example needs to locate a cell in the context of a larger table. To do this, the Chameleon model calls the Row Lookup (Row Lookup) tool in the toolbox to accurately locate the relevant rows in the table.

Next, the Chameleon model only needs to understand the simplified table and generate the final natural language answer without the need to generate Python code to enhance mathematical reasoning.

Similarly, the Chameleon model shows strong reasoning capabilities in TabMWP tasks.

The following figure highlights the key benchmark models in these two tasks.

In the ScienceQA task, the Chameleon model cooperates with GPT-4 to achieve an accuracy of 86.5%, which is the best few-shot model at present.

Similarly, Chameleon achieved 98.8% accuracy on TabMWP datasets, leading the 17.0% performance of state-of-the-art models.

Ablation experiment reveals the key modules of Chameleon the researchers carried out ablation experiments and analyzed the decline of the accuracy of the Chameleon model when the key modules in the generation program were disabled.

The experimental results show that the knowledge retrieval (Knowledge Retrieval) module plays an important role in both tasks.

For ScienceQA tasks, domain-specific tools, such as Bing search (Bing Search) and vision-related tools, play a key role, while the commonly used "program generation (Program Generator)" modules have a significant impact on final performance in TabMWP tasks.

Tool Planning ability of Chameleon Model proportion of use of different tools

By visualizing the proportion of different tools in the programs generated by Chameleon models, we can observe that LLM planners show different planning behaviors when using different language models.

In general, ChatGPT has a strong preference for using or not using certain tools.

For example, when answering ScienceQA questions, ChatGPT tends to call "knowledge search" (Knowledge Retrieval), which accounts for 72%, and "Bing search (Bing Search)" only 3% of the time.

In the TabMWP task, ChatGPT relies more on the Row Lookup (Row Lookup) tool and calls less on the column Lookup (Column Lookup).

GPT-4 is more objective and rational in the choice of tools.

For example, when answering ScienceQA's scientific questions, GPT-4 calls "knowledge search" more frequently and "Bing search" (11% vs. 3%) more frequently than ChatGPT.

Transition diagram of tool invocation

By visualizing the state transition diagrams of different tools in the program generated by the Chameleon model, we can observe the rules shown by the LLM planner in tool invocation.

For example, in ScienceQA tasks, Chameleon models usually choose to use "knowledge retrieval (Knowledge Retrieval)" to obtain internal knowledge in large language models, or to call "Bing search (Bing Search)" to obtain online information on the Internet.

During the TabMWP task, the researchers observed two main tool invocation patterns:

The Chameleon model either completes the answer directly through natural language reasoning, or uses program generation related tools to enhance logical and mathematical reasoning.

Further development of Chameleon model through its simple and efficient framework, Chameleon model realizes the efficient cooperation between large language models and a variety of external tools, thus significantly enhancing the reasoning ability on complex tasks. In the area of tool enhancements for large language models, there are many potential directions in the future:

Extension toolkit: you can extend the toolkit to more tools, including domain-specific tools, such as Wolfram. This will further increase the applicability of the Chameleon model in different tasks and fields and make it a more comprehensive multi-functional tool.

Improved planners: consider proposing more accurate planners, such as tools that can plan the next step step by step, and optimize the planning based on the feedback of the implementation results. This will help to improve the efficiency and accuracy of Chameleon models in complex tasks.

Lightweight substitution: in the future, we can consider replacing the parts involving large language models with more lightweight local models, in order to reduce the consumption of computing resources, improve the response speed of the model, and reduce deployment costs. This will make the Chameleon model more suitable for practical application scenarios.

In a word, the future development of Chameleon model is expected to make a greater breakthrough in the field of tool enhancement, provide more powerful support for solving complex problems, and expand its scope of application.

references

Links to papers: https://arxiv.org/ abs / 2304.09842

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.