We need meta-evaluation research to think beyond one-fits-all automatic evaluation and develop richer assessments in human evaluation, and iterate to adapt them to advances in capabilities. 🔄
We need meta-evaluation research to think beyond one-fits-all automatic evaluation and develop richer assessments in human evaluation, and iterate to adapt them to advances in capabilities. 🔄
Despite their effectiveness, none of them are standard practice.
✔️We’ve compiled a checklist to help incorporate them in model evaluations.
Despite their effectiveness, none of them are standard practice.
✔️We’ve compiled a checklist to help incorporate them in model evaluations.
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).