This dashboard presents model comparisons using an automatic qualitative evaluation approach that generates human-interpretable, natural language summaries of model behavior for specific skills or topics. The report cards are designed to provide specific, faithful, and interpretable assessments of model capabilities.
Details can be found in the arxiv paper.
Help us improve the dashboard by sharing your thoughts! Please email us - our contact information can be found in the arxiv paper linked above.