LLM Evaluation Examples

The notebooks listed below contain step-by-step tutorials on how to use MLflow to evaluate LLMs. The first notebook is centered around evaluating an LLM for question-answering with a prompt engineering approach. The second notebook is centered around evaluating a RAG system. Both notebooks will demonstrate how to use MLflow’s builtin metrics such as token_count and toxicity as well as LLM-judged intelligent metrics such as answer_relevance. The third notebook is the same as the second notebook, but uses Databricks’s served llama2-70b as the judge instead of gpt-4.

QA Evaluation Notebook

If you would like a copy of this notebook to execute in your environment, download the notebook here:

Download the notebook

To follow along and see the sections of the notebook guide, click below:

View the Notebook

RAG Evaluation Notebook (using gpt-4-as-judge)

If you would like a copy of this notebook to execute in your environment, download the notebook here:

Download the notebook

To follow along and see the sections of the notebook guide, click below:

View the Notebook

RAG Evaluation Notebook (using llama2-70b-as-judge)

If you would like a copy of this notebook to execute in your environment, download the notebook here:

Download the notebook

To follow along and see the sections of the notebook guide, click below:

View the Notebook

Evaluating a 🤗 Hugging Face LLM Notebook (using gpt-4-as-judge)

Learn how to evaluate an Open-Source 🤗 Hugging Face LLM with MLflow evaluate by downloading the notebook here:

Download the notebook

Or follow along directly in the docs here:

View the Notebook