Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
Blair Yang *, Fuyang Cui *, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, Michael R. Zhang
NeurIPS 2024 SoLaR Workshop (Spotlight) ยท 2024
An automated qualitative-evaluation framework for specialized, open-ended, and agentic tasks of LLMs.