MLflow for LLM Ops: Track, Evaluate, and Govern Your AI Models
By Gennoor Tech·February 28, 2026
MLflow has evolved into a full LLM operations platform supporting prompt versioning, model evaluation with custom metrics, A/B testing of prompts, and centralized model governance — essential for production AI systems.
As large language models move from proof-of-concept to production, organizations face a critical challenge: how do you track experiments, evaluate model quality, govern deployments, and monitor performance at scale? MLflow, originally built for traditional machine learning operations, has evolved into a comprehensive platform for LLM operations. This guide explores how to leverage MLflow to bring rigor, repeatability, and governance to your AI initiatives.
MLflow Architecture for LLM Operations
MLflow's architecture consists of several interconnected components that work together to support the entire LLM lifecycle:
Core Components
- Tracking Server: Central repository for logging experiments, metrics, parameters, and artifacts
- Model Registry: Version control system for models with stage transitions and approval workflows
- Evaluation Engine: Framework for assessing model quality using built-in and custom metrics
- Tracing System: Distributed tracing for complex LLM chains and agent workflows
- AI Gateway: Unified interface for multiple LLM providers with routing and fallback logic
Storage Backend
MLflow requires two types of storage:
- Metadata Store: SQL database (PostgreSQL, MySQL, or SQL Server) that stores experiment metadata, metrics, and parameters
- Artifact Store: Object storage (Azure Blob Storage, S3, or ADLS Gen2) for large artifacts like model files, datasets, and visualizations
For enterprise deployments, host the tracking server on Azure Container Apps or Azure Kubernetes Service for scalability and reliability. Use managed database services like Azure Database for PostgreSQL and private Azure Blob Storage accounts for security.
Authentication and Authorization
MLflow 2.9+ includes built-in authentication and authorization features essential for enterprise use:
- Azure AD integration for SSO
- Role-based access control (RBAC) for experiments and models
- API token management for programmatic access
- Audit logging for compliance
Setting Up MLflow for LLM Tracking: Step-by-Step
Infrastructure Deployment
Start by deploying the MLflow tracking server infrastructure:
# Create resource group
az group create --name mlflow-rg --location eastus
# Deploy PostgreSQL database
az postgres flexible-server create
--name mlflow-db-12345
--resource-group mlflow-rg
--location eastus
--admin-user mlflowin
--admin-password [secure-password]
--sku-name Standard_B2s
--tier Burstable
# Create storage account
az storage account create
--name mlflowartifacts12345
--resource-group mlflow-rg
--location eastus
--sku Standard_LRS
Server Configuration
Configure the MLflow tracking server with appropriate connection strings and security settings. Store sensitive configuration in Azure Key Vault and reference it via managed identity. Enable HTTPS with a proper SSL certificate for production use.
Client Installation
Install MLflow with LLM-specific dependencies:
pip install mlflow[gateway]>=2.9.0
pip install openai langchain transformers sentence-transformers
Connection Setup
Configure your development environment to connect to the tracking server:
import mlflow
# Set tracking URI
mlflow.set_tracking_uri("https://mlflow.yourcompany.com")
# Authenticate (if using Azure AD)
mlflow.set_experiment("loan-chatbot-v1")
Logging Prompts, Responses, Tokens, and Costs
Comprehensive logging is the foundation of effective LLM operations. MLflow provides specialized APIs for capturing LLM interactions:
Basic Prompt and Response Logging
import mlflow
with mlflow.start_run():
# Log parameters
mlflow.log_param("model", "gpt-4")
mlflow.log_param("temperature", 0.7)
mlflow.log_param("max_tokens", 500)
# Log prompt and response
mlflow.log_text(prompt, "prompt.txt")
mlflow.log_text(response, "response.txt")
# Log metrics
mlflow.log_metric("tokens_used", token_count)
mlflow.log_metric("latency_ms", latency)
mlflow.log_metric("cost_usd", cost)
Advanced Token and Cost Tracking
For production systems, implement automated token counting and cost calculation:
import tiktoken
def calculate_cost(model, prompt_tokens, completion_tokens):
# Pricing as of 2026
pricing = {
"gpt-4": {"prompt": 0.03/1000, "completion": 0.06/1000},
"gpt-3.5-turbo": {"prompt": 0.0015/1000, "completion": 0.002/1000}
}
model_pricing = pricing.get(model, {"prompt": 0, "completion": 0})
return (prompt_tokens * model_pricing["prompt"]) +
(completion_tokens * model_pricing["completion"])
# Count tokens accurately
encoding = tiktoken.encoding_for_model("gpt-4")
prompt_tokens = len(encoding.encode(prompt))
completion_tokens = len(encoding.encode(response))
cost = calculate_cost("gpt-4", prompt_tokens, completion_tokens)
mlflow.log_metric("cost_usd", cost)
Structured Logging for Analysis
Log structured data as JSON for easier querying and analysis:
import json
interaction_data = {
"timestamp": datetime.utcnow().isoformat(),
"user_id": user_id,
"session_id": session_id,
"model": model_name,
tokens": {"prompt": prompt_tokens, "completion": completion_tokens},
"latency_ms": latency,
"cost_usd": cost,
"feedback": user_feedback
}
mlflow.log_dict(interaction_data, "interaction_metadata.json")
This structured approach enables sophisticated analysis like cost per user, average latency by model, and correlation between parameters and user satisfaction.
MLflow Evaluate Deep Dive
MLflow Evaluate provides a framework for systematically assessing LLM quality, combining automated metrics with human judgment patterns.
Built-in Metrics
MLflow includes several pre-built metrics for common evaluation scenarios:
- Perplexity: Measures how well a language model predicts text (lower is better)
- BLEU Score: Compares generated text to reference translations
- ROUGE Score: Evaluates summarization quality by comparing to reference summaries
- Toxicity: Detects harmful or inappropriate content using Perspective API
- Flesch Reading Ease: Assesses text readability
LLM-as-Judge Metrics
One of the most powerful evaluation approaches uses a strong LLM like GPT-4 to judge responses from the model being evaluated:
from mlflow.metrics import answer_relevance, answer_correctness
# Define evaluation data
eval_data = [
{
"inputs": "What is the capital of France?",
"ground_truth": "Paris is the capital of France.",
"predictions": model_response
}
]
# Run evaluation
results = mlflow.evaluate(
data=eval_data,
model_type="question-answering",
metrics=[
answer_relevance(model="openai:/gpt-4"),
answer_correctness(model="openai:/gpt-4")
]
)
Custom Metrics
For domain-specific evaluation, create custom metrics:
from mlflow.metrics import make_metric
def contains_required_fields(predictions, targets, metrics):
"""Check if response includes all required business fields"""
required_fields = ["loan_amount", "interest_rate", "term"]
scores = []
for pred in predictions:
fields_present = sum(1 for field in required_fields if field in pred.lower())
scores.append(fields_present / len(required_fields))
return {"scores": scores, "aggregate": sum(scores) / len(scores)}
field_coverage_metric = make_metric(
eval_fn=contains_required_fields,
greater_is_better=True,
name="field_coverage"
)
Evaluation Workflows
Establish systematic evaluation workflows:
- Baseline Establishment: Evaluate initial model performance to set benchmarks
- A/B Testing: Compare multiple model versions or prompt templates side-by-side
- Regression Testing: Verify new versions don't degrade performance on known test cases
- Continuous Evaluation: Run automated evaluations on production traffic samples
For comprehensive training on evaluation strategies, check out our MLOps training programs.
Distributed Tracing for RAG and Agent Chains
Modern LLM applications involve complex chains of operations: retrieval, reranking, prompt construction, LLM inference, and post-processing. MLflow's tracing system provides visibility into these multi-step workflows.
Enabling Tracing
import mlflow
# Enable automatic tracing for LangChain
mlflow.langchain.autolog()
Manual Span Creation
For custom code, manually create spans that represent logical operations:
with mlflow.start_span(name="retrieve_documents") as span:
documents = vector_store.similarity_search(query, k=5)
span.set_attribute("num_documents", len(documents))
span.set_attribute("retrieval_latency_ms", retrieval_time)
with mlflow.start_span(name="rerank_documents") as span:
reranked_docs = reranker.rerank(query, documents)
span.set_attribute("rerank_model", "cross-encoder")
with mlflow.start_span(name="generate_response") as span:
response = llm.generate(prompt)
span.set_attribute("llm_model", "gpt-4")
span.set_attribute("tokens_used", response.tokens)
Trace Analysis
Use the MLflow UI to visualize trace timelines, identify bottlenecks, and understand the flow of data through your application. Filter traces by user, session, error status, or custom attributes. Export trace data for detailed offline analysis.
AI Gateway Configuration and Multi-Provider Routing
MLflow's AI Gateway provides a unified interface to multiple LLM providers with sophisticated routing logic.
Gateway Configuration
# gateway_config.yaml
routes:
- name: gpt-4-primary
route_type: llm/v1/completions
model:
provider: openai
name: gpt-4
config:
openai_api_key: $OPENAI_API_KEY
- name: azure-gpt-4-fallback
route_type: llm/v1/completions
model:
provider: azure-openai
name: gpt-4
config:
azure_endpoint: https://yourservice.openai.azure.com
api_key: $AZURE_OPENAI_KEY
- name: claude-alternative
route_type: llm/v1/completions
model:
provider: anthropic
name: claude-3-opus
config:
api_key: $ANTHROPIC_API_KEY
Routing Logic
Implement intelligent routing based on request characteristics:
- Load Balancing: Distribute requests across multiple providers to avoid rate limits
- Cost Optimization: Route simple queries to cheaper models, complex ones to premium models
- Failover: Automatically retry failed requests with alternative providers
- A/B Testing: Route a percentage of traffic to experimental models
Gateway Benefits
The AI Gateway provides several operational advantages:
- Single API interface regardless of underlying provider
- Centralized API key management and rotation
- Request/response logging without application code changes
- Rate limiting and quota management
- Cost tracking and attribution
Model Registry Workflows
The MLflow Model Registry provides version control and lifecycle management for LLM applications.
Registering Models
# Register a model
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "loan-chatbot")
# Add description and tags
client = mlflow.tracking.MlflowClient()
client.update_registered_model(
name="loan-chatbot",
description="Customer-facing chatbot for loan inquiries"
)
client.set_registered_model_tag(
name="loan-chatbot",
key="team",
value="retail-banking"
)
Stage Transitions
Models progress through stages representing their lifecycle:
- None: Initial registration, not yet ready for use
- Staging: Deployed to test environment for validation
- Production: Serving live user traffic
- Archived: Deprecated, maintained for historical reference
# Promote to staging
client.transition_model_version_stage(
name="loan-chatbot",
version=3,
stage="Staging",
archive_existing_versions=True
)
# After validation, promote to production
client.transition_model_version_stage(
name="loan-chatbot",
version=3,
stage="Production"
)
Approval Workflows
Implement approval gates before production deployment:
- Data scientist registers model version
- Automated evaluation pipeline runs quality checks
- If checks pass, model moves to "Pending Approval" status
- ML engineer reviews evaluation results and approves transition
- Model automatically deploys to production environment
- Monitoring dashboards track performance
Integrate with ServiceNow, Jira, or Azure DevOps for formal change management in regulated industries.
Model Aliases
Use aliases for flexible deployments:
# Set alias for blue-green deployment
client.set_registered_model_alias(
name="loan-chatbot",
alias="champion",
version=3
)
# Application code references alias
model = mlflow.pyfunc.load_model("models:/loan-chatbot@champion")
This allows zero-downtime deployments by updating the alias to point to new versions.
Integration with Azure ML and Databricks
Azure Machine Learning Integration
Azure ML provides managed MLflow tracking with enterprise features:
- Automatic infrastructure provisioning and scaling
- Azure AD authentication integration
- VNet isolation for security
- Managed endpoints for model serving
- Cost management and budgeting
Configure Azure ML as your MLflow backend:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient(
credential=DefaultAzureCredential(),
subscription_id="your-subscription-id",
resource_group_name="your-rg",
workspace_name="your-workspace"
)
mlflow_tracking_uri = ml_client.workspaces.get(ml_client.workspace_name).mlflow_tracking_uri
mlflow.set_tracking_uri(mlflow_tracking_uri)
Databricks Integration
Databricks provides deep MLflow integration with additional features:
- Notebook-based experiment tracking
- Distributed training on Spark clusters
- Feature Store integration
- Model serving with auto-scaling
- Unity Catalog for data governance
MLflow is pre-configured in Databricks notebooks—just start logging. For more advanced patterns, explore our blog posts on Databricks architecture.
Comparing MLflow with Alternatives
MLflow vs. Weights & Biases
| Feature | MLflow | Weights & Biases |
|---|---|---|
| Cost | Open source, self-hosted or managed | SaaS with free tier, paid plans for teams |
| Visualization | Basic built-in UI | Rich interactive dashboards |
| Collaboration | RBAC in enterprise version | Strong team features, comments, reports |
| LLM Support | Native LLM tracking and evaluation | Growing LLM support with Prompts feature |
MLflow vs. LangSmith
LangSmith (by LangChain creators) is purpose-built for LLM applications:
- Pros: Excellent tracing for LangChain apps, intuitive UI, fast iteration cycles
- Cons: SaaS-only (no self-hosting), less mature than MLflow for traditional ML, smaller ecosystem
Consider LangSmith if you're heavily invested in LangChain. Choose MLflow for broader ML ops needs or if you require self-hosting.
MLflow vs. Arize
Arize focuses on production monitoring rather than experiment tracking:
- Arize Strengths: Advanced drift detection, model performance degradation alerts, rootcause analysis
- MLflow Strengths: Experiment tracking, model registry, broader ML lifecycle support
Many organizations use both: MLflow for development and deployment, Arize for production monitoring.
Real Enterprise Deployment Patterns
Pattern 1: Centralized MLflow with Azure ML
One Azure ML workspace hosts MLflow tracking for the entire organization. Teams get separate experiments with RBAC. Models deploy to Azure ML managed endpoints. Works well for organizations already standardized on Azure.
Pattern 2: Federated MLflow per Business Unit
Each business unit runs its own MLflow server on Azure Container Apps. A central registry tracks all models across units. Appropriate for large organizations with distinct compliance requirements per unit.
Pattern 3: Databricks-Centric
All data science work happens in Databricks with built-in MLflow. Models export to external systems via REST APIs. Best for organizations with heavy Spark usage and large-scale data processing needs.
Pattern 4: Hybrid Cloud
MLflow tracking server on-premises for sensitive experiments, with selective model promotion to Azure for production serving. Addresses data residency and compliance concerns.
Production Monitoring Dashboards
Create comprehensive dashboards that surface key operational metrics:
Model Performance Dashboard
- Prediction latency (p50, p95, p99)
- Error rates by error type
- Model confidence score distributions
- User feedback ratings over time
Cost Dashboard
- Daily/weekly/monthly LLM API costs
- Cost per user or per session
- Token usage by model and application
- Cost trends and forecasts
Usage Dashboard
- Request volume by hour/day
- Active users and sessions
- Most common query patterns
- Feature usage (which tools/capabilities are most used)
Quality Dashboard
- Automated evaluation metric trends
- Human feedback scores
- Toxicity detection alerts
- Hallucination rate estimates
Build these dashboards in Power BI, Grafana, or Azure Monitor Workbooks depending on your organization's tooling standards.
Cost Tracking and Optimization
LLM costs can escalate quickly without proper tracking and optimization.
Cost Attribution
Tag every MLflow run with cost center, project, and user information:
mlflow.set_tags({
"cost_center": "retail-banking",
"project": "loan-chatbot",
"user": user_email
})
Query MLflow API to aggregate costs and generate chargebacks.
Optimization Strategies
- Prompt Optimization: Shorter prompts with same quality reduce costs 20-40%
- Model Selection: Use GPT-3.5 for simple queries, GPT-4 only when necessary
- Caching: Cache responses to identical queries to avoid redundant API calls
- Batch Processing: Process multiple items in single API calls when possible
- Streaming: Use streaming responses to provide faster perceived performance without increasing costs
Team Collaboration Features
MLflow enables effective collaboration across data science teams:
Experiment Sharing
Share experiment URLs with colleagues for code review and knowledge transfer. Experiments are automatically versioned with full reproducibility.
Model Reviews
Use model version descriptions and tags to document review findings:
client.update_model_version(
name="loan-chatbot",
version=3,
description="Passed security review 2026-03-15. Approved by J. Smith."
)
client.set_model_version_tag(
name="loan-chatbot",
version=3,
key="security_review_status",
value="approved"
)
Runbooks and Documentation
Store runbooks, troubleshooting guides, and deployment procedures as artifacts in MLflow. This keeps documentation version-controlled alongside models.
Governance for Regulated Industries
Financial services, healthcare, and other regulated industries require additional governance controls:
Audit Trails
MLflow automatically logs who created each experiment, when models were registered, and who approved stage transitions. Export audit logs to SIEM systems for compliance reporting.
Data Lineage
Tag runs with dataset versions and feature engineering logic. This provides traceability from model predictions back to source data for regulatory inquiries.
Model Cards
Generate model cards documenting intended use, training data characteristics, evaluation results, and known limitations. Store model cards as artifacts in the Model Registry.
Access Controls
Implement least-privilege access:
- Data scientists can create experiments and register models
- ML engineers can transition models to staging
- Only designated approvers can promote to production
- Auditors have read-only access across all experiments and models
Conclusion
MLflow provides a comprehensive platform for operationalizing LLM applications from experiment tracking through production monitoring. By implementing systematic tracking, rigorous evaluation, governed deployment workflows, and continuous monitoring, organizations can move beyond ad-hoc LLM experimentation to production-grade AI systems.
Success with MLflow requires both technical implementation and organizational commitment to MLOps principles. Start with basic experiment tracking, gradually add evaluation frameworks, implement model registry workflows, and finally deploy comprehensive production monitoring. As your practices mature, MLflow scales with you from individual data scientists to enterprise-wide AI governance.
Whether you're deploying RAG pipelines, agent workflows, or traditional LLM applications, MLflow provides the infrastructure necessary for reliability, reproducibility, and accountability at scale.
Jalal Ahmed Khan
Microsoft Certified Trainer (MCT) · Founder, Gennoor Tech
14+ years in enterprise AI and cloud technologies. Delivered AI transformation programs for Fortune 500 companies across 6 countries including Boeing, Aramco, HDFC Bank, and Siemens. Holds 16 active Microsoft certifications including Azure AI Engineer and Power BI Analyst.