What are Scorers?
Scorers are key components of the MLflow GenAI evaluation framework. They provide a unified interface to define evaluation criteria for your models, agents, and applications.
Scorers can be considered as metrics in the traditional ML sense. However, they are more flexible and can return more structured quality feedback, not only the scalar values that are typically represented by metrics.
How Scorers Work
Scorers analyze inputs, outputs, and traces from your GenAI application and produce quality assessments. Here's the flow:
- You provide a dataset of inputs(and optionally other columns such asexpectations)
- MLflow runs your
predict_fn
to generateoutputsandtracesfor each row in the dataset. Alternatively, you can provide outputs and traces directly in the dataset and omit the predict function. - Scorers receive the inputs,outputs,expectations, andtraces(or a subset of them) and produce scores and metadata such as explanations and source information.
- MLflow aggregates the scorer results and saves them. You can analyze the results in the UI.
What Scorers you should use?
MLflow provides different types of scorers to address different evaluation needs:
I want to try evaluation quickly and get some results fast.
→ Use Predefined Scorers to get started.
I want to evaluate my application with a simple natural language criteria, such as "The response must be polite".
→ Use Guidelines-based Scorers.
I want to use more advanced prompt for evaluating my application.
→ Use Prompt-based Scorers.
I want to dump the entire trace to the scorer and get detailed insights from it.
→ Use Agent-as-a-Judge Scorers.
I want to write my own code for evaluating my application. Other scorers don't fit my advanced needs.
→ Use Code-based Scorers to implement your own evaluation logic with Python.
If you are still not sure about which scorer to use, you can ask to the Ask AI (add image) widget in the right below.
How to Write a Good Scorer?
The general metrics such as 'Hallucination' or 'Toxicity' rarely work in practice. Successful practitioners analyze real data to uncover domain-specific failure modes and then define custom evaluation criteria from the ground up. Here is the general workflow of how to define a good scorer and iterate on it with MLflow.
Generate traces or collect them from production
Gather human feedback
Error analysis
To organize traces into error categories, use Trace Tag to label and filter traces.
Translate failure modes into Scorers
Align scorers with human feedback.
As you iterate on the scorer, version control becomes important. MLflow can track Scorer Versions to help you maintain changes and share the improved scorers with your team.