PromptCBLUE: convert CBLUE into a prompt-format, multi-task Chinese medical benchmark and report baselines
PromptCBLUE gives a practical, Chinese-language testbed for medical LLM products. It shows that inexpensive PEFT fine-tuning of open 13B models can beat few-shot API use, so companies can invest in targeted fine-tuning to improve medical features.
Key finding
Fine-tuned open-source 13B models outperform few-shot commercial APIs on PromptCBLUE.
Numbers: Baichuan-13B (LoRA fine-tuned) overall 0.71 vs GPT-4 few-shot 0.518

