Use past benchmark results to learn cheap routers that pick the best LLM for a new task
You can often get similar or better task performance while spending less on inference by routing to smaller LLMs selected from past benchmark outputs, and you only need a few labeled examples to improve reliability.
Key finding
OOD-aware score S3 improves selection over best-model-on-average (BMA) on HELM
Numbers: S3 acc=0.694 vs BMA (llama-2-70b) acc=0.688 (Table 1 averages)

