Supported Models

When no model is specified, MLflow uses a default based on your environment:

Databricks: "databricks" (a Databricks-hosted model designed for LLM and AI agent quality assessments)
Other environments: "openai:/gpt-4o-mini"

You can also explicitly specify a model from any of the following sources:

AI Gateway Endpoints

AI Gateway endpoints are the recommended way to configure judge models, especially when creating judges from the UI. Benefits include:

Run judges directly from the UI - Test and execute judges without leaving the browser
Centralized API key management - No need to configure API keys locally
Traffic routing and fallbacks - Configure load balancing and provider fallbacks

To use AI Gateway endpoints, select the endpoint from the UI dropdown or specify the endpoint name from the SDK with the gateway:/ prefix, e.g., gateway:/my-chat-endpoint.

Direct Model Providers

MLflow supports calling model providers directly using the format provider:/model-name. Each provider may require specific credentials set as environment variables:

Provider	URI Format	Environment Variables
OpenAI	`openai:/gpt-5.4-mini`	`OPENAI_API_KEY`
Azure OpenAI	`azure:/my-deployment`	`AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`
Anthropic	`anthropic:/claude-sonnet-4-5`	`ANTHROPIC_API_KEY`
Amazon Bedrock	`bedrock:/google.gemma-3-4b-it`	`AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN` (optional). Alternatively: `AWS_BEARER_TOKEN_BEDROCK` for API key auth, or `AWS_ROLE_ARN` for IAM role.
Google Gemini	`gemini:/gemini-3.1-pro-preview`	`GEMINI_API_KEY`
Mistral	`mistral:/mistral-small-2603`	`MISTRAL_API_KEY`
xAI	`xai:/grok-4.20-0309-reasoning`	`XAI_API_KEY`
Vertex AI	`vertex_ai:/gemini-3-flash-preview`	`VERTEX_PROJECT`, `VERTEX_LOCATION` (optional), `VERTEX_CREDENTIALS` (optional)
Groq	`groq:/llama-3.3-70b-versatile`	`GROQ_API_KEY`
DeepSeek	`deepseek:/deepseek-chat`	`DEEPSEEK_API_KEY`
OpenRouter	`openrouter:/openai/gpt-5.4-nano`	`OPENROUTER_API_KEY`. See OpenRouter model list for model names.
Together AI	`togetherai:/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`	`TOGETHERAI_API_KEY`
Ollama	`ollama:/llama3.2`	None (local)
Databricks	`databricks:/databricks-claude-sonnet-4-5`	Databricks SDK authentication (e.g. `DATABRICKS_HOST` + `DATABRICKS_TOKEN`, or other supported methods)

warning

Judges configured with direct model providers require credentials to be available locally (typically via environment variables) and cannot be run from the UI. Use AI Gateway endpoints if you want to run the judges from the UI.

For any models that are not supported natively, it is also possible to use LiteLLM. Since LiteLLM is not a dependency of MLflow, you'll need to install it separately by running pip install litellm. After this, simply specify the provider and model name in the same format as natively supported providers.

Databricks-Hosted Models

When using MLflow in Databricks, you can use Databricks-hosted models using the following formats:

"databricks" (default): a default Databricks-hosted model designed for LLM and AI agent quality assessments.
"databricks:/<model-name>": Other Databricks-hosted models of your choice (e.g., databricks:/databricks-gpt-5-mini, databricks:/databricks-claude-sonnet-4-5). For a full list, see LiteLLM Models and select "databricks" as the provider.
"databricks:/<endpoint-name>" or "endpoints:/<endpoint-name>": Custom model endpoints on Databricks (e.g., databricks:/my-endpoint).

Choosing the Right LLM for Your Judge

The choice of LLM model significantly impacts judge performance and cost. Here's guidance based on your development stage and use case:

Early Development Stage (Inner Loop)

Recommended: Start with powerful models like GPT-4o or Claude Opus
Why: When you're beginning your agent development journey, you typically lack:
- Use-case-specific grading criteria
- Labeled data for optimization
Benefits: More intelligent models can deeply explore traces, identify patterns, and help you understand common issues in your system
Trade-off: Higher cost, but lower evaluation volume during development makes this acceptable

Production & Scaling Stage

Recommended: Transition to smaller models (GPT-4o-mini, Claude Haiku) with smarter optimizers
Why: As you move toward production:
- You've collected labeled data and established grading criteria
- Cost becomes a critical factor at scale
- You can align smaller judges using more powerful optimizers
Approach: Use a smaller judge model paired with a powerful optimizer model (e.g., GPT-4o-mini judge aligned using Claude Opus optimizer)

AI Gateway Endpoints​

Direct Model Providers​

Databricks-Hosted Models​

Choosing the Right LLM for Your Judge​

Early Development Stage (Inner Loop)​

Production & Scaling Stage​