Use strong LLMs (e.g., GPT-4) as scalable judges for human preference with checks for bias and math errors
High-quality LLMs (e.g., GPT-4) can automate preference labeling at ~80–85% human agreement, drastically cutting the time and cost of human evaluations for product iterations while remaining explainable.
Key finding
GPT-4 judgments align with human experts on non-tied MT-bench votes.
Numbers: 85% agreement (MT-bench non-tie, Table 5)

