AI monitoring is the practice of continuously evaluating the quality, performance, cost, and safety of AI applications running in production. LLM monitoring focuses on individual model calls, tracking output quality, hallucinations, token costs, and latency, while agent monitoring extends this to multi-step reasoning, tool selection, and task completion. Both go beyond uptime and error rates to assess the quality of non-deterministic outputs and detect when behavior drifts from expected standards. Production tracing captures the execution data that makes this possible.
Unlike classical ML monitoring (which tracks feature distributions and prediction accuracy on structured data), AI monitoring must evaluate free-form language outputs, multi-step agent reasoning, tool call chains, retrieval accuracy, and token costs. Traditional monitoring can tell you the system is running; AI monitoring tells you whether it's working well.
MLflow provides a complete AI monitoring stack: automatic online evaluation with LLM judges that score traces asynchronously, configurable trace sampling for cost control, user and session context tracking for debugging, human feedback collection, and built-in scorers for hallucination detection, safety, and more. Explore the evaluation and monitoring docs.
Agents and LLM applications in production face challenges that don't exist during development:
Problem: Agent outputs degrade silently from model updates, prompt changes, or shifting user inputs.
Solution: Continuous LLM judges and human feedback detect quality regressions before users lose trust.
Problem: Token costs and latency can spiral without visibility into per-request spending and response times.
Solution: Automatic cost/token tracking with per-model breakdowns and anomaly detection.
Problem: Production agents face prompt injection, PII leakage, jailbreaks, and policy violations that don't exist in development.
Solution: Real-time safety scoring with deterministic and LLM-based detectors on every request.
Problem: When quality drops or errors spike, tracing the root cause across multi-step agent workflows is complex.
Solution: Full execution traces with assessment scores enable rapid root-cause analysis.
MLflow provides an open-source AI monitoring stack that covers tracing, automatic quality evaluation with LLM judges, cost and token tracking, human feedback collection, and real-time safety guardrails, compatible with any LLM provider and any agent framework. Here's how to set it up.
@mlflow.trace to capture execution graphs. Attach user, session, and deployment context.mlflow.log_feedback() to record user ratings linked to traces. Catch quality issues that automated judges miss and calibrate scoring over time.Trace production requests with context
import mlflowimport osfrom fastapi import FastAPI, Requestfrom pydantic import BaseModelapp = FastAPI()class ChatRequest(BaseModel):message: str@app.post("/chat")@mlflow.tracedef handle_chat(request: Request, chat_request: ChatRequest):# Attach production context to every tracemlflow.update_current_trace(client_request_id=request.headers.get("X-Request-ID"),tags={"mlflow.trace.session": request.headers.get("X-Session-ID"),"mlflow.trace.user": request.headers.get("X-User-ID"),"environment": "production","app_version": os.getenv("APP_VERSION", "1.0.0"),"deployment_id": os.getenv("DEPLOYMENT_ID", "unknown"),},)response = generate_response(chat_request.message)return {"response": response}
Register judges for automatic online evaluation
Collect user feedback on traces
import mlflowfrom mlflow.entities import AssessmentSourcefrom fastapi import FastAPIapp = FastAPI()@app.post("/feedback")def submit_feedback(trace_id: str, is_correct: bool, user_id: str):mlflow.log_feedback(trace_id=trace_id,name="response_is_correct",value=is_correct,source=AssessmentSource(source_type="HUMAN",source_id=user_id,),)
MLflow is the largest open-source AI engineering platform, with over 30 million monthly downloads. Thousands of organizations use MLflow to debug, evaluate, monitor, and optimize production-quality AI agents and LLM applications while controlling costs and managing access to models and data. Backed by the Linux Foundation and licensed under Apache 2.0, MLflow provides a complete AI monitoring solution with no vendor lock-in. Get started →
When choosing an AI monitoring platform for agents and LLM applications, the decision between open source and proprietary SaaS tools has significant long-term implications for your team, infrastructure, and data ownership.
Open Source (MLflow): With MLflow, you maintain complete control over your production traces and monitoring data. Deploy on your own infrastructure or use managed versions on Databricks, AWS, or other platforms. There are no per-trace fees, no usage limits, and no vendor lock-in. Your production data stays under your control, and OpenTelemetry compatibility ensures you can export traces to any backend.
Proprietary SaaS Tools: Commercial monitoring platforms offer convenience but at the cost of flexibility and control. They typically charge per trace or per seat, which can become expensive at scale. Your production data is sent to their servers, raising privacy and compliance concerns for sensitive traces. You're locked into their ecosystem, making it difficult to switch providers or customize functionality.
Why Teams Choose Open Source: Organizations running production agents increasingly choose MLflow because it offers enterprise-grade monitoring without compromising on data sovereignty, cost predictability, or flexibility. The Apache 2.0 license and Linux Foundation backing ensure MLflow remains truly open and community-driven, not controlled by a single vendor.