Kubernetes Deployment for AI Agents: Real-World Experience with LangGraph¶
EMM (Expert Memory Machine) started as a local experiment. Docker Compose, three containers, 5GB RAM. It worked. Agents classified files, saved to Obsidian vault, everything local.
Then came the moment when scale was needed. Not performance scale (traffic was minimal), but operational scale: multiple environments, isolated services, proper monitoring, automated deployments.
Docker Compose doesn't fit this. Kubernetes does.
But transitioning from monolith to microservices isn't just "write deployment.yaml and kubectl apply". Questions emerge that didn't exist in local development:
How do agents find MCP services? Service discovery. Where to store cache? In-memory doesn't work with multiple pods. Where to put secrets? Environment variables in Git - bad idea. How to do rolling updates without downtime?
Here's how I solved these problems. With real manifests, mistakes, and solutions that work.
Architecture before and after¶
Before: Docker Compose monolith¶
docker-compose.yml:
├─ langgraph-api (contains all agents)
├─ ollama (LLM)
└─ redis (optional, rarely used)
Problems:
- All agents in one container. One crash = entire system down.
- MCP handlers - in-process calls. Works locally, doesn't scale.
- Cache in-memory in MCPClient. Works for single instance, lost on restart.
- Secrets in .env file. Git-ignored, but manual sync between environments.
After: Kubernetes microservices¶
Kubernetes cluster:
├─ Agent Pods (3)
│ ├─ confluence-agent
│ ├─ bookmark-scraper
│ └─ file-system-agent
├─ MCP Service Pods (7)
│ ├─ jd-classifier-service
│ ├─ content-classifier-service
│ ├─ bookmark-classifier-service
│ ├─ confluence-service
│ ├─ file-system-service
│ ├─ web-scraper-service
│ └─ notifications-service
├─ Infrastructure (2)
│ ├─ redis-cache (StatefulSet)
│ └─ langgraph-api
└─ Monitoring (3)
├─ prometheus
├─ tempo
└─ grafana
15+ pods instead of 3 containers. Seems like overkill. But each pod has clear responsibility, scales easily, and can fail independently.
Challenge 1: Service Discovery¶
Locally MCP Client called handlers directly:
# mcp-servers/client.py - local version
def call(self, uri: str, params: dict):
handler = self._handlers.get(uri)
return handler(params) # direct call
Works. Fast. But in Kubernetes agents and MCP services are different pods. Direct call doesn't work.
Need HTTP.
Solution: HTTP wrapper + Kubernetes Services¶
Each MCP service became HTTP server:
# mcp-servers/jd-classifier/server.py
from fastapi import FastAPI
app = FastAPI()
@app.post("/get_structure")
def get_structure(request: dict):
# existing handler logic
return {"structure": ...}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8080)
Kubernetes Service provides stable DNS name:
apiVersion: v1
kind: Service
metadata:
name: jd-classifier-service
namespace: agentic-ai
spec:
selector:
app: jd-classifier-service
ports:
- port: 8080
targetPort: 8080
MCP Client now makes HTTP requests:
# Updated client.py
def call(self, uri: str, params: dict):
if os.getenv("K8S_MODE") == "true":
service_name = self._parse_service(uri)
url = f"http://{service_name}-service:8080/{uri}"
response = requests.post(url, json=params)
return response.json()
else:
# local mode - direct handlers
return self._handlers[uri](params)
Environment variable K8S_MODE=true switches between local and Kubernetes mode. Same code, two environments.
Performance impact¶
HTTP overhead: ~5-10ms per request vs 0ms for direct call.
But this is negligible. Agents don't make thousands of requests per second. Typical workflow: 10-50 MCP calls per agent run. 10ms * 50 = 500ms overhead. Acceptable.
Trade-off: 500ms latency for isolation, independent scaling, and fault tolerance. Worth it.
Challenge 2: Distributed Cache¶
Locally MCPClient had in-memory cache:
class MCPClient:
def __init__(self):
self._cache = {} # in-memory dict
def get(self, key: str):
return self._cache.get(key)
Works for single process. But in Kubernetes:
Agent Pod 1 → writes to cache → stored in memory
Agent Pod 2 → reads from cache → MISS (different memory)
Each pod has its own memory space. Cache not shared.
Solution: Redis StatefulSet¶
Redis - distributed cache. Shared memory between pods.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cache
spec:
serviceName: redis-service
replicas: 1
template:
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
volumeMounts:
- name: redis-data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: redis-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
StatefulSet instead of Deployment because Redis needs persistent storage and stable network identity.
MCPClient connects to Redis:
import redis
class MCPClient:
def __init__(self):
if os.getenv("K8S_MODE") == "true":
self._cache = redis.Redis(
host=os.getenv("REDIS_HOST", "redis-service"),
port=int(os.getenv("REDIS_PORT", "6379")),
decode_responses=True
)
else:
self._cache = {} # local fallback
Now cache is shared:
Agent Pod 1 → writes to Redis → stored in Redis PVC
Agent Pod 2 → reads from Redis → HIT (same Redis instance)
Redis performance¶
Latency: ~1-2ms for GET/SET in Kubernetes cluster (same AZ).
Cache hit ratio: ~85% for JD structure lookups (frequently accessed).
Memory usage: 50MB for typical workload (5000 files indexed).
Persistence: RDB snapshots every 5 minutes + AOF for durability.
Challenge 3: Secrets Management¶
Locally secrets in .env file:
CONFLUENCE_BASE_URL=https://...
CONFLUENCE_USERNAME=[email protected]
CONFLUENCE_PASSWORD=super_secret_password
Git-ignored. Manually copied between machines. Not scalable. Not secure.
Kubernetes has Secrets API:
apiVersion: v1
kind: Secret
metadata:
name: confluence-credentials
namespace: agentic-ai
type: Opaque
stringData:
base_url: "https://your-confluence.atlassian.net"
username: "your-username"
password: "your-password"
Pods mount secrets as environment variables:
spec:
containers:
- name: confluence-service
env:
- name: CONFLUENCE_BASE_URL
valueFrom:
secretKeyRef:
name: confluence-credentials
key: base_url
Secrets creation - don't commit to Git¶
⚠️ IMPORTANT: Don't add secrets to manifests that go to Git. Use:
Option 1: kubectl create secret
kubectl create secret generic confluence-credentials \
--from-literal=base_url="https://..." \
--from-literal=username="..." \
--from-literal=password="..." \
--namespace agentic-ai
Option 2: External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: confluence-credentials
spec:
secretStoreRef:
name: aws-secretsmanager
target:
name: confluence-credentials
data:
- secretKey: password
remoteRef:
key: prod/confluence/password
Secrets stored in AWS Secrets Manager / HashiCorp Vault, not Git. External Secrets Operator syncs them to Kubernetes.
Option 3: Sealed Secrets
# Encrypt secret
kubeseal < secret.yaml > sealed-secret.yaml
# Commit sealed-secret.yaml to Git (encrypted)
git add sealed-secret.yaml
# SealedSecret controller decrypts in cluster
I use Option 1 for development, Option 2 for production.
Challenge 4: Persistent Storage¶
Obsidian vault and UNSORTED folder - where to store them?
Locally: hostPath (~/vault, ~/unsorted). Works on laptop.
Kubernetes: pods are ephemeral. When pod restarts, filesystem is lost.
Solution depends on environment¶
Single-node Kubernetes (development):
hostPath works:
volumes:
- name: vault-storage
hostPath:
path: /mnt/data/vault
type: DirectoryOrCreate
Pod mounts host filesystem. Works for Minikube, Docker Desktop, single-node k3s.
Multi-node Kubernetes (production):
hostPath doesn't work - pods can schedule on different nodes.
Need shared storage:
Option 1: NFS
volumes:
- name: vault-storage
nfs:
server: nfs-server.example.com
path: /exports/vault
NFS server accessible from all nodes. Pods on any node can read/write.
Option 2: Cloud storage (AWS EFS, Google Filestore)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vault-pvc
spec:
accessModes: ["ReadWriteMany"]
storageClassName: "efs-sc"
resources:
requests:
storage: 20Gi
EFS - managed NFS from AWS. Zero maintenance. Automatic scaling. Cost: ~$0.30/GB/month.
Option 3: S3 (lakeFS integration)
For EMM I chose hybrid approach: - Development: hostPath (local testing) - Production: S3 via lakeFS (versioning + cloud storage)
File System Agent works with storage_backend.py abstraction - backend configuration via environment variables.
Challenge 5: Rolling Updates without Downtime¶
Docker Compose update strategy:
docker-compose down
docker-compose pull
docker-compose up -d
Downtime: ~30 seconds while containers restart.
Kubernetes has rolling updates:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
maxUnavailable: 0 means: don't shut down old pod before new pod is ready.
Process:
- Kubernetes creates new pod with new image
- New pod passes readiness probe
- When new pod ready, traffic switches to it
- Old pod gracefully shuts down
- Repeat for each replica
Downtime: zero.
Readiness Probes - critically important¶
Without readiness probe Kubernetes immediately routes traffic to new pod, even if it's not ready yet. Result: 500 errors.
With readiness probe:
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
Kubernetes waits until /health endpoint returns 200 OK. Only then routes traffic.
MCP services have /health endpoint:
@app.get("/health")
def health_check():
# Check dependencies
redis_ok = check_redis_connection()
llm_ok = check_llm_connection()
if redis_ok and llm_ok:
return {"status": "healthy"}
else:
raise HTTPException(status_code=503, detail="unhealthy")
If dependencies are failing, /health returns 503. Kubernetes doesn't route traffic to this pod.
Challenge 6: Resource Limits¶
Without resource limits pods can consume all cluster memory/CPU.
Content Classifier especially memory-hungry (LLM inference):
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
requests: minimum resources Kubernetes guarantees. Scheduler won't place pod on node without these resources.
limits: maximum resources pod can use. If pod exceeds memory limit, Kubernetes kills pod (OOMKilled).
I set limits through profiling:
- Ran pod without limits
- Monitored memory/CPU usage via
kubectl top pod - Peak memory: 3.2GB (during batch classification 100 files)
- Set limit: 4GB (20% buffer)
Similarly for CPU: peak 1.5 cores, set limit 2 cores.
JD Classifier service - lightweight (only YAML parsing):
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
100m = 0.1 CPU core. 128Mi = 128 megabytes. Sufficient for parsing jd.yaml.
Deployment Process - Automation¶
I don't write kubectl apply manually for 15+ manifests. Automated via scripts.
setup-from-env.sh - Generate Configs¶
#!/bin/bash
# Reads .env file, generates Kubernetes manifests
source .env
# Generate secret for Confluence
kubectl create secret generic confluence-credentials \
--from-literal=base_url="$CONFLUENCE_BASE_URL" \
--from-literal=username="$CONFLUENCE_USERNAME" \
--from-literal=password="$CONFLUENCE_PASSWORD" \
--namespace agentic-ai \
--dry-run=client -o yaml > manifests/01-secrets.yaml
# Generate ConfigMap for JD structure
kubectl create configmap jd-structure \
--from-file=jd.yaml \
--namespace agentic-ai \
--dry-run=client -o yaml > manifests/02-configmap.yaml
--dry-run=client -o yaml generates YAML manifest without applying to cluster. Output redirected to file.
Result: secrets and configs in YAML format, ready for kubectl apply.
build-and-deploy.sh - Build Images + Deploy¶
#!/bin/bash
# Build Docker images
docker build -t agentic-ai/jd-classifier-service:latest \
-f mcp-servers/jd-classifier/Dockerfile .
docker build -t agentic-ai/content-classifier-service:latest \
-f mcp-servers/content-classifier/Dockerfile .
# ... build other services ...
# Tag images for registry
docker tag agentic-ai/jd-classifier-service:latest \
my-registry.com/jd-classifier-service:latest
# Push to registry
docker push my-registry.com/jd-classifier-service:latest
# Apply manifests
kubectl apply -f manifests/00-namespace.yaml
kubectl apply -f manifests/01-secrets.yaml
kubectl apply -f manifests/02-configmap.yaml
kubectl apply -f manifests/03-redis.yaml
kubectl apply -f manifests/04-mcp-services.yaml
kubectl apply -f manifests/05-agents.yaml
# Wait for rollout
kubectl rollout status deployment/jd-classifier-service -n agentic-ai
kubectl rollout status deployment/content-classifier-service -n agentic-ai
Single command: ./deploy/kubernetes/build-and-deploy.sh
Automatically builds, pushes, deploys. Time: ~5 minutes for full deployment.
Monitoring - Prometheus + Grafana¶
Kubernetes provides metrics out of the box, but need to collect them.
Prometheus - Metrics Collection¶
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Prometheus scrapes metrics from pods with annotation prometheus.io/scrape: "true".
MCP services expose metrics:
from prometheus_client import Counter, Histogram, generate_latest
requests_total = Counter('mcp_requests_total', 'Total MCP requests')
request_duration = Histogram('mcp_request_duration_seconds', 'MCP request duration')
@app.post("/classify")
def classify(request: dict):
with request_duration.time():
requests_total.inc()
# handler logic
return result
@app.get("/metrics")
def metrics():
return Response(generate_latest(), media_type="text/plain")
Prometheus scrapes /metrics endpoint every 15 seconds.
Grafana - Visualization¶
Dashboard queries:
# Request rate
rate(mcp_requests_total[5m])
# Error rate
rate(mcp_requests_total{status="error"}[5m])
# P95 latency
histogram_quantile(0.95, mcp_request_duration_seconds_bucket)
# Memory usage
container_memory_usage_bytes{pod=~".*-service.*"}
Dashboard shows: - Request rate per service - Error rate - Latency P50/P95/P99 - Memory/CPU usage - Pod restarts
Grafana alerting: if error rate > 5% or P95 latency > 2s, send Slack notification.
Cost Analysis - What it costs¶
Development environment (Minikube on laptop): $0.
Production environment (managed Kubernetes):
Compute: - 3 worker nodes (t3.medium): $0.0416/hour × 3 × 730 hours = ~$91/month - 15 pods, average 0.5 CPU, 1GB RAM per pod: fits on 3 nodes
Storage: - Redis PVC: 10GB × $0.10/GB/month = $1/month - EFS for vault: 20GB × $0.30/GB/month = $6/month
Networking: - LoadBalancer: $18/month (AWS ELB) - Data transfer: ~$1/month (internal traffic free)
Total: ~$117/month for production-grade Kubernetes deployment.
Alternative (Docker Compose on single VPS): - t3.large instance: $0.0832/hour × 730 = ~$61/month
Kubernetes more expensive ($117 vs $61), but provides: - Zero-downtime deployments - Horizontal scaling - Service isolation - Professional monitoring - Disaster recovery
Trade-off: $56/month for operational peace of mind. Worth it.
Mistakes I made¶
1. Forgot readiness probes initially¶
First deployment: pods created, traffic routed immediately, but services not ready yet. 500 errors for 30 seconds while services booted.
Fix: added readinessProbe with initialDelaySeconds: 10. Kubernetes now waits.
2. Resource limits too low¶
Content Classifier service constantly OOMKilled (Out Of Memory). Pod restarts, traffic loss, errors.
Cause: LLM inference needs 3GB memory, but I set limit 1GB.
Fix: profiling via kubectl top pod, increased limit to 4GB. Problem solved.
3. Redis without persistence¶
First Redis deployment was Deployment without PVC. When Redis pod restarted, all cache lost.
Fix: changed to StatefulSet with PVC. Cache persists across restarts.
4. Secrets in Git (initial commit)¶
Initially I committed secrets.yaml with real passwords. Then realized this is public repo.
Fix:
1. git filter-branch to remove secrets from history (painful)
2. Regenerated all passwords
3. Added secrets.yaml to .gitignore
4. Use kubectl create secret instead of YAML files
Lesson learned: never commit secrets. Never.
Conclusions from production deployment¶
Kubernetes deployment for AI agents is not trivial. Questions emerge that didn't exist in local development:
Service discovery → HTTP wrapper + Kubernetes Services
Distributed cache → Redis StatefulSet
Secrets management → Kubernetes Secrets API + External Secrets Operator
Storage → NFS/EFS for multi-node, hostPath for single-node, S3 for production
Rolling updates → maxUnavailable: 0 + readiness probes
Resource limits → Profiling + 20% buffer
Monitoring → Prometheus metrics + Grafana dashboards
Deployment process automated via scripts. Single command deploys 15+ services.
Cost: ~$117/month for production cluster vs $61/month for single VPS. Trade-off: operational reliability for extra $56/month.
Mistakes: forgot readiness probes, resource limits too low, Redis without persistence, secrets in Git. Fixed through iterations.
For EMM transition from Docker Compose to Kubernetes took 2 weeks. Development, testing, deployment, monitoring setup. Result: zero-downtime updates, isolation, scaling ready.
If building multi-service AI platform - Kubernetes provides flexibility and reliability. Initial setup more complex, but long-term operational benefits pay off.
Related: Data Versioning for AI Agents with lakeFS, Developing and Testing AI Agents
Author: Igor Gorovyy Role: DevOps Engineer Lead & Senior Solutions Architect LinkedIn: linkedin.com/in/gorovyyigor
Deployment summary¶
Environment: Kubernetes 1.28+ Services: 15 pods (3 agents, 7 MCP services, 5 infrastructure) Storage: Redis (10GB PVC), EFS (20GB for vault) Cost: ~$117/month (production), $0 (development) Deployment time: 5 minutes (automated script) Uptime: 99.9% (zero-downtime rolling updates)
