Skip to main content
13 posts tagged with "evaluation"
Agent Optimization Pipeline
Build a tool-calling agent, evaluate it with domain-specific judges, align those judges to expert feedback, and optimize the system prompt with GEPA.
evaluationoptimizationagentsprompts
Cost-Quality Tradeoff Analysis Across LLM Providers
Compare quality and cost across LLM providers using MLflow evaluation and tracing.
evaluationcosttracing
Building Custom LLM Judges
Evaluate GenAI outputs using built-in guideline scorers, custom programmatic scorers, and custom LLM-based judges.
evaluationscorersjudges
Evaluation-Driven Development
Use MLflow evaluation to find weaknesses in a GenAI application, fix them, and measure the improvement in a repeatable loop.
evaluationdevelopmentprompts
Tracing and Evaluating a LangGraph Agent
Build a tool-calling travel planning agent with LangGraph, trace every step with MLflow, and evaluate tool selection accuracy.
agentstracingevaluationlanggraph
Evaluating a Multi-Turn Conversational Agent
Evaluate multi-turn customer support chat quality with MLflow's conversational scorers.
agentsevaluationmulti-turn
Tracing and Evaluating OpenAI Agents
Build an e-commerce agent with OpenAI function calling, trace it with MLflow, and evaluate tool selection accuracy.
agentstracingevaluationopenai
Prompt Engineering Lifecycle
Version, evaluate, and promote prompt templates using MLflow's prompt registry and evaluation framework.
promptsevaluationregistry
End-to-End RAG Evaluation
Build a RAG pipeline, trace it with MLflow, and evaluate retrieval and generation quality with built-in judges.
evaluationragretrieval
Red-Teaming Your LLM Application
Test your LLM application against adversarial inputs using MLflow evaluation with safety scorers and custom guidelines.
evaluationsafetyred-teaming
Evaluating Databricks Genie Spaces
A complete pipeline for tracing, evaluating, and improving a Databricks Genie space using MLflow.
databricksgenieevaluationtracingagents
Genie Evaluation with LLM Judges
Score Genie traces with built-in and custom judges to find quality issues in responses and SQL generation.
databricksgenieevaluationagents
Genie Space Improvement Generator
Take traces that failed evaluation, combine them with your Genie space config, and generate copy-paste-ready fixes with an LLM.
databricksgenieevaluationagents