MLflow

Skip to main content

13 posts tagged with "evaluation"

Agent Optimization Pipeline

Build a tool-calling agent, evaluate it with domain-specific judges, align those judges to expert feedback, and optimize the system prompt with GEPA.

evaluationoptimizationagentsprompts

Cost-Quality Tradeoff Analysis Across LLM Providers

Compare quality and cost across LLM providers using MLflow evaluation and tracing.

evaluationcosttracing

Building Custom LLM Judges

Evaluate GenAI outputs using built-in guideline scorers, custom programmatic scorers, and custom LLM-based judges.

evaluationscorersjudges

Evaluation-Driven Development

Use MLflow evaluation to find weaknesses in a GenAI application, fix them, and measure the improvement in a repeatable loop.

evaluationdevelopmentprompts

Tracing and Evaluating a LangGraph Agent

Build a tool-calling travel planning agent with LangGraph, trace every step with MLflow, and evaluate tool selection accuracy.

agentstracingevaluationlanggraph

Evaluating a Multi-Turn Conversational Agent

Evaluate multi-turn customer support chat quality with MLflow's conversational scorers.

agentsevaluationmulti-turn

Tracing and Evaluating OpenAI Agents

Build an e-commerce agent with OpenAI function calling, trace it with MLflow, and evaluate tool selection accuracy.

agentstracingevaluationopenai

Prompt Engineering Lifecycle

Version, evaluate, and promote prompt templates using MLflow's prompt registry and evaluation framework.

promptsevaluationregistry

End-to-End RAG Evaluation

Build a RAG pipeline, trace it with MLflow, and evaluate retrieval and generation quality with built-in judges.

evaluationragretrieval

Red-Teaming Your LLM Application

Test your LLM application against adversarial inputs using MLflow evaluation with safety scorers and custom guidelines.

evaluationsafetyred-teaming

Evaluating Databricks Genie Spaces

A complete pipeline for tracing, evaluating, and improving a Databricks Genie space using MLflow.

databricksgenieevaluationtracingagents

Genie Evaluation with LLM Judges

Score Genie traces with built-in and custom judges to find quality issues in responses and SQL generation.

databricksgenieevaluationagents

Genie Space Improvement Generator

Take traces that failed evaluation, combine them with your Genie space config, and generate copy-paste-ready fixes with an LLM.

databricksgenieevaluationagents