Microsoft makes GPT-4 a medical expert only by "prompt engineering"! More than a large number of highly tuned models, the accuracy of professional testing exceeded 90% for the first time. 07/13 Update SLTechnology News&Howtos

Microsoft makes GPT-4 a medical expert only by "prompt engineering"! More than a large number of highly tuned models, the accuracy of professional testing exceeded 90% for the first time.

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

The latest research from Microsoft once again proves the power of prompt engineering--

No extra fine-tuning, no expert planning, just hints, GPT-4 can become an "expert".

Using their latest prompt strategy, Medprompt, GPT-4 achieved the best results in nine MultiMed QA test sets in the field of medical expertise.

On the MedQA data set, Medprompt made GPT-4 more than 90% accurate for the first time, surpassing many fine-tuning methods such as BioGPT and Med-PaLM.

The researchers also say that the Medprompt method is universal and can be applied not only to medicine, but also to electrical engineering, machine learning, law and other majors.

As soon as the study was shared on X (formerly Twitter), it attracted the attention of many netizens.

Ethan Mollick, a professor at Wharton School, and Carlos E. Perez, author of Artificial Intuition, all retweeted and shared.

Carlos E. Perez called "an excellent tip strategy can be tweaked a lot":

Some netizens said that they had such a premonition for a long time, and now that they can see the result, it is really "so cool":

Other netizens said it was really "radical":

GPT-4 is an industry-changing technology, and we are far from reaching the limit of hints or fine-tuning.

Combined prompt strategy, "transformation" expert Medprompt is a combination of multiple prompt strategies, including three magic weapons:

Dynamic small sample selection (Dynamic few-shot selection)

Self-generated thinking chain (Self-generated chain of thought)

Option shuffle Integration (Choice shuffling ensemble)

Let's introduce them one by one.

Dynamic learning with fewer samples is an effective way to make the model learn context quickly. To put it simply, enter some examples to quickly adapt the model to a specific domain and learn to follow the format of the task.

This small sample sample for specific task prompts is usually fixed, so there are high requirements for the representativeness and extensiveness of the examples.

The previous approach is to have domain experts create examples manually, but even so, there is no guarantee that the fixed small sample examples planned by the experts will be representative in each task.

Therefore, Microsoft researchers have proposed a method of dynamic small sample examples.

The idea is that the task training set can be used as a source of small sample samples, and if the training set is large enough, you can select different small sample examples for different task inputs.

Specifically, the researchers first use the text-embedding-ada-002 model to generate vector representations for each training sample and test sample. Then, for each test sample, the most similar k samples are selected from the training samples based on vector similarity.

Compared with the fine-tuning method, the dynamic small sample selection uses the training data, but there is no need to update the model parameters.

The method of self-generating thinking chain (CoT) is to make the model think step by step and generate a series of intermediate reasoning steps.

The previous approach also relies on experts to manually write a small number of examples with a chain of prompts.

Here, the researchers found that you could simply ask GPT-4 to generate a chain of thoughts for the training example using the following tips:

But the researchers also pointed out that this automatically generated thought chain may contain wrong reasoning steps, so set a verification tag as a filter, which can effectively reduce errors.

Compared with the example of the thought chain handmade by experts in the Med-PaLM 2 model, the basic principle of the thought chain generated by GPT-4 is longer and the step-by-step reasoning logic is more fine-grained.

Option shuffle integration in addition, GPT-4 may have a bias when doing multiple choice questions, that is, no matter what the content of the option is, it will always choose A, or always choose B, which is the position deviation.

To reduce this problem, the researchers chose to rearrange the original options out of order. For example, the original option is ABCD, which can be changed to BCDA or CDAB.

Then ask GPT-4 to do multiple rounds of prediction, each using a different order of options. This "forces" GPT-4 to consider the content of the option.

Finally, make a vote on the results of multiple rounds of prediction and choose the most consistent and correct option.

The combination of the above prompt strategies is Medprompt. Let's take a look at the test results.

Multiple tests are the best in the test, the researchers used the MultiMed QA evaluation benchmark.

GPT-4 using Medprompt prompt strategy has the highest score in all nine benchmark datasets of MultiMedQA, which is better than Flan-PaLM 540B and Med-PaLM 2.

In addition, the researchers also discussed the performance of the Medprompt strategy on "Eyes-Off" data, that is, in data never seen by the model during training or optimization, which was used to test whether the model overfitted the training data.

Results GPT-4 combined with Medprompt strategy performed well on multiple medical benchmark data sets, with an average accuracy of 91.3%.

The researchers also conducted ablation experiments on MedQA datasets to explore the relative contribution of the three components to overall performance.

Among them, the step of automatically generating thought chain contributes the most to performance improvement.

And the thought chain automatically generated by GPT-4 scored higher than that of expert planners in Med-PaLM 2:

Finally, the researchers also explored the cross-domain generalization ability of Medprompt, using six different data sets from the MMLU benchmark, covering electrical engineering, machine learning, philosophy, professional accounting, professional law, and professional psychology.

Two other data sets containing NCLEX (American Nursing Licensing examination) questions have also been added.

The results show that the effect of Medprompt on these data sets is similar to that on MultiMedQA medical data sets, with an average accuracy improvement of 7.3%.

Links to papers: https://arxiv.org/ pdf / 2311.16452.pdf

Reference link:

[1] https://twitter.com/erichorvitz/status/1729854235443884385

[2] https://twitter.com/emollick/status/1729733749657473327

This article comes from the official account of Wechat: quantum bit (ID:QbitAI), author: Xifeng

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.