Kubernetes Deployment for AI Agents: Real-World Experience with LangGraph¶

Українська версія доступна

EMM (Expert Memory Machine) started as a local experiment. Docker Compose, three containers, 5GB RAM. It worked. Agents classified files, saved to Obsidian vault, everything local.

Then came the moment when scale was needed. Not performance scale (traffic was minimal), but operational scale: multiple environments, isolated services, proper monitoring, automated deployments.

Docker Compose doesn't fit this. Kubernetes does.

But transitioning from monolith to microservices isn't just "write deployment.yaml and kubectl apply". Questions emerge that didn't exist in local development:

How do agents find MCP services? Service discovery. Where to store cache? In-memory doesn't work with multiple pods. Where to put secrets? Environment variables in Git - bad idea. How to do rolling updates without downtime?

Here's how I solved these problems. With real manifests, mistakes, and solutions that work.

Architecture before and after¶

Before: Docker Compose monolith¶

docker-compose.yml:
  ├─ langgraph-api (contains all agents)
  ├─ ollama (LLM)
  └─ redis (optional, rarely used)

Problems: - All agents in one container. One crash = entire system down. - MCP handlers - in-process calls. Works locally, doesn't scale. - Cache in-memory in MCPClient. Works for single instance, lost on restart. - Secrets in .env file. Git-ignored, but manual sync between environments.

After: Kubernetes microservices¶

Kubernetes cluster:
  ├─ Agent Pods (3)
  │   ├─ confluence-agent
  │   ├─ bookmark-scraper
  │   └─ file-system-agent
  ├─ MCP Service Pods (7)
  │   ├─ jd-classifier-service
  │   ├─ content-classifier-service
  │   ├─ bookmark-classifier-service
  │   ├─ confluence-service
  │   ├─ file-system-service
  │   ├─ web-scraper-service
  │   └─ notifications-service
  ├─ Infrastructure (2)
  │   ├─ redis-cache (StatefulSet)
  │   └─ langgraph-api
  └─ Monitoring (3)
      ├─ prometheus
      ├─ tempo
      └─ grafana

15+ pods instead of 3 containers. Seems like overkill. But each pod has clear responsibility, scales easily, and can fail independently.

Challenge 1: Service Discovery¶

Locally MCP Client called handlers directly:

# mcp-servers/client.py - local version
def call(self, uri: str, params: dict):
    handler = self._handlers.get(uri)
    return handler(params)  # direct call

Works. Fast. But in Kubernetes agents and MCP services are different pods. Direct call doesn't work.

Need HTTP.

Solution: HTTP wrapper + Kubernetes Services¶

Each MCP service became HTTP server:

# mcp-servers/jd-classifier/server.py
from fastapi import FastAPI

app = FastAPI()

@app.post("/get_structure")
def get_structure(request: dict):
    # existing handler logic
    return {"structure": ...}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

Kubernetes Service provides stable DNS name:

apiVersion: v1
kind: Service
metadata:
  name: jd-classifier-service
  namespace: agentic-ai
spec:
  selector:
    app: jd-classifier-service
  ports:
  - port: 8080
    targetPort: 8080

MCP Client now makes HTTP requests:

# Updated client.py
def call(self, uri: str, params: dict):
    if os.getenv("K8S_MODE") == "true":
        service_name = self._parse_service(uri)
        url = f"http://{service_name}-service:8080/{uri}"
        response = requests.post(url, json=params)
        return response.json()
    else:
        # local mode - direct handlers
        return self._handlers[uri](params)

Environment variable K8S_MODE=true switches between local and Kubernetes mode. Same code, two environments.

Performance impact¶

HTTP overhead: ~5-10ms per request vs 0ms for direct call.

But this is negligible. Agents don't make thousands of requests per second. Typical workflow: 10-50 MCP calls per agent run. 10ms * 50 = 500ms overhead. Acceptable.

Trade-off: 500ms latency for isolation, independent scaling, and fault tolerance. Worth it.

Challenge 2: Distributed Cache¶

Locally MCPClient had in-memory cache:

class MCPClient:
    def __init__(self):
        self._cache = {}  # in-memory dict

    def get(self, key: str):
        return self._cache.get(key)

Works for single process. But in Kubernetes:

Agent Pod 1 → writes to cache → stored in memory
Agent Pod 2 → reads from cache → MISS (different memory)

Each pod has its own memory space. Cache not shared.

Solution: Redis StatefulSet¶

Redis - distributed cache. Shared memory between pods.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cache
spec:
  serviceName: redis-service
  replicas: 1
  template:
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        ports:
        - containerPort: 6379
        volumeMounts:
        - name: redis-data
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: redis-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

StatefulSet instead of Deployment because Redis needs persistent storage and stable network identity.

MCPClient connects to Redis:

import redis

class MCPClient:
    def __init__(self):
        if os.getenv("K8S_MODE") == "true":
            self._cache = redis.Redis(
                host=os.getenv("REDIS_HOST", "redis-service"),
                port=int(os.getenv("REDIS_PORT", "6379")),
                decode_responses=True
            )
        else:
            self._cache = {}  # local fallback

Now cache is shared:

Agent Pod 1 → writes to Redis → stored in Redis PVC
Agent Pod 2 → reads from Redis → HIT (same Redis instance)

Redis performance¶

Latency: ~1-2ms for GET/SET in Kubernetes cluster (same AZ).

Cache hit ratio: ~85% for JD structure lookups (frequently accessed).

Memory usage: 50MB for typical workload (5000 files indexed).

Persistence: RDB snapshots every 5 minutes + AOF for durability.

Challenge 3: Secrets Management¶

Locally secrets in .env file:

CONFLUENCE_BASE_URL=https://...
CONFLUENCE_USERNAME=[email protected]
CONFLUENCE_PASSWORD=super_secret_password

Git-ignored. Manually copied between machines. Not scalable. Not secure.

Kubernetes has Secrets API:

apiVersion: v1
kind: Secret
metadata:
  name: confluence-credentials
  namespace: agentic-ai
type: Opaque
stringData:
  base_url: "https://your-confluence.atlassian.net"
  username: "your-username"
  password: "your-password"

Pods mount secrets as environment variables:

spec:
  containers:
  - name: confluence-service
    env:
    - name: CONFLUENCE_BASE_URL
      valueFrom:
        secretKeyRef:
          name: confluence-credentials
          key: base_url

Secrets creation - don't commit to Git¶

⚠️ IMPORTANT: Don't add secrets to manifests that go to Git. Use:

Option 1: kubectl create secret

kubectl create secret generic confluence-credentials \
  --from-literal=base_url="https://..." \
  --from-literal=username="..." \
  --from-literal=password="..." \
  --namespace agentic-ai

Option 2: External Secrets Operator

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: confluence-credentials
spec:
  secretStoreRef:
    name: aws-secretsmanager
  target:
    name: confluence-credentials
  data:
  - secretKey: password
    remoteRef:
      key: prod/confluence/password

Secrets stored in AWS Secrets Manager / HashiCorp Vault, not Git. External Secrets Operator syncs them to Kubernetes.

Option 3: Sealed Secrets

# Encrypt secret
kubeseal < secret.yaml > sealed-secret.yaml

# Commit sealed-secret.yaml to Git (encrypted)
git add sealed-secret.yaml

# SealedSecret controller decrypts in cluster

I use Option 1 for development, Option 2 for production.

Challenge 4: Persistent Storage¶

Obsidian vault and UNSORTED folder - where to store them?

Locally: hostPath (~/vault, ~/unsorted). Works on laptop.

Kubernetes: pods are ephemeral. When pod restarts, filesystem is lost.

Solution depends on environment¶

Single-node Kubernetes (development):

hostPath works:

volumes:
- name: vault-storage
  hostPath:
    path: /mnt/data/vault
    type: DirectoryOrCreate

Pod mounts host filesystem. Works for Minikube, Docker Desktop, single-node k3s.

Multi-node Kubernetes (production):

hostPath doesn't work - pods can schedule on different nodes.

Need shared storage:

Option 1: NFS

volumes:
- name: vault-storage
  nfs:
    server: nfs-server.example.com
    path: /exports/vault

NFS server accessible from all nodes. Pods on any node can read/write.

Option 2: Cloud storage (AWS EFS, Google Filestore)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vault-pvc
spec:
  accessModes: ["ReadWriteMany"]
  storageClassName: "efs-sc"
  resources:
    requests:
      storage: 20Gi

EFS - managed NFS from AWS. Zero maintenance. Automatic scaling. Cost: ~$0.30/GB/month.

Option 3: S3 (lakeFS integration)

For EMM I chose hybrid approach: - Development: hostPath (local testing) - Production: S3 via lakeFS (versioning + cloud storage)

File System Agent works with storage_backend.py abstraction - backend configuration via environment variables.

Challenge 5: Rolling Updates without Downtime¶

Docker Compose update strategy:

docker-compose down
docker-compose pull
docker-compose up -d

Downtime: ~30 seconds while containers restart.

Kubernetes has rolling updates:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

maxUnavailable: 0 means: don't shut down old pod before new pod is ready.

Process:

Kubernetes creates new pod with new image
New pod passes readiness probe
When new pod ready, traffic switches to it
Old pod gracefully shuts down
Repeat for each replica

Downtime: zero.

Readiness Probes - critically important¶

Without readiness probe Kubernetes immediately routes traffic to new pod, even if it's not ready yet. Result: 500 errors.

With readiness probe:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

Kubernetes waits until /health endpoint returns 200 OK. Only then routes traffic.

MCP services have /health endpoint:

@app.get("/health")
def health_check():
    # Check dependencies
    redis_ok = check_redis_connection()
    llm_ok = check_llm_connection()

    if redis_ok and llm_ok:
        return {"status": "healthy"}
    else:
        raise HTTPException(status_code=503, detail="unhealthy")

If dependencies are failing, /health returns 503. Kubernetes doesn't route traffic to this pod.

Challenge 6: Resource Limits¶

Without resource limits pods can consume all cluster memory/CPU.

Content Classifier especially memory-hungry (LLM inference):

resources:
  requests:
    memory: "2Gi"
    cpu: "1"
  limits:
    memory: "4Gi"
    cpu: "2"

requests: minimum resources Kubernetes guarantees. Scheduler won't place pod on node without these resources.

limits: maximum resources pod can use. If pod exceeds memory limit, Kubernetes kills pod (OOMKilled).

I set limits through profiling:

Ran pod without limits
Monitored memory/CPU usage via kubectl top pod
Peak memory: 3.2GB (during batch classification 100 files)
Set limit: 4GB (20% buffer)

Similarly for CPU: peak 1.5 cores, set limit 2 cores.

JD Classifier service - lightweight (only YAML parsing):

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "500m"

100m = 0.1 CPU core. 128Mi = 128 megabytes. Sufficient for parsing jd.yaml.

Deployment Process - Automation¶

I don't write kubectl apply manually for 15+ manifests. Automated via scripts.

setup-from-env.sh - Generate Configs¶

#!/bin/bash
# Reads .env file, generates Kubernetes manifests

source .env

# Generate secret for Confluence
kubectl create secret generic confluence-credentials \
  --from-literal=base_url="$CONFLUENCE_BASE_URL" \
  --from-literal=username="$CONFLUENCE_USERNAME" \
  --from-literal=password="$CONFLUENCE_PASSWORD" \
  --namespace agentic-ai \
  --dry-run=client -o yaml > manifests/01-secrets.yaml

# Generate ConfigMap for JD structure
kubectl create configmap jd-structure \
  --from-file=jd.yaml \
  --namespace agentic-ai \
  --dry-run=client -o yaml > manifests/02-configmap.yaml

--dry-run=client -o yaml generates YAML manifest without applying to cluster. Output redirected to file.

Result: secrets and configs in YAML format, ready for kubectl apply.

build-and-deploy.sh - Build Images + Deploy¶

#!/bin/bash

# Build Docker images
docker build -t agentic-ai/jd-classifier-service:latest \
  -f mcp-servers/jd-classifier/Dockerfile .

docker build -t agentic-ai/content-classifier-service:latest \
  -f mcp-servers/content-classifier/Dockerfile .

# ... build other services ...

# Tag images for registry
docker tag agentic-ai/jd-classifier-service:latest \
  my-registry.com/jd-classifier-service:latest

# Push to registry
docker push my-registry.com/jd-classifier-service:latest

# Apply manifests
kubectl apply -f manifests/00-namespace.yaml
kubectl apply -f manifests/01-secrets.yaml
kubectl apply -f manifests/02-configmap.yaml
kubectl apply -f manifests/03-redis.yaml
kubectl apply -f manifests/04-mcp-services.yaml
kubectl apply -f manifests/05-agents.yaml

# Wait for rollout
kubectl rollout status deployment/jd-classifier-service -n agentic-ai
kubectl rollout status deployment/content-classifier-service -n agentic-ai

Single command: ./deploy/kubernetes/build-and-deploy.sh

Automatically builds, pushes, deploys. Time: ~5 minutes for full deployment.

Monitoring - Prometheus + Grafana¶

Kubernetes provides metrics out of the box, but need to collect them.

Prometheus - Metrics Collection¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Prometheus scrapes metrics from pods with annotation prometheus.io/scrape: "true".

MCP services expose metrics:

from prometheus_client import Counter, Histogram, generate_latest

requests_total = Counter('mcp_requests_total', 'Total MCP requests')
request_duration = Histogram('mcp_request_duration_seconds', 'MCP request duration')

@app.post("/classify")
def classify(request: dict):
    with request_duration.time():
        requests_total.inc()
        # handler logic
        return result

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type="text/plain")

Prometheus scrapes /metrics endpoint every 15 seconds.

Grafana - Visualization¶

Dashboard queries:

# Request rate
rate(mcp_requests_total[5m])

# Error rate
rate(mcp_requests_total{status="error"}[5m])

# P95 latency
histogram_quantile(0.95, mcp_request_duration_seconds_bucket)

# Memory usage
container_memory_usage_bytes{pod=~".*-service.*"}

Dashboard shows: - Request rate per service - Error rate - Latency P50/P95/P99 - Memory/CPU usage - Pod restarts

Grafana alerting: if error rate > 5% or P95 latency > 2s, send Slack notification.

Cost Analysis - What it costs¶

Development environment (Minikube on laptop): $0.

Production environment (managed Kubernetes):

Compute: - 3 worker nodes (t3.medium): $0.0416/hour × 3 × 730 hours = ~$91/month - 15 pods, average 0.5 CPU, 1GB RAM per pod: fits on 3 nodes

Storage: - Redis PVC: 10GB × $0.10/GB/month = $1/month - EFS for vault: 20GB × $0.30/GB/month = $6/month

Networking: - LoadBalancer: $18/month (AWS ELB) - Data transfer: ~$1/month (internal traffic free)

Total: ~$117/month for production-grade Kubernetes deployment.

Alternative (Docker Compose on single VPS): - t3.large instance: $0.0832/hour × 730 = ~$61/month

Kubernetes more expensive ($117 vs $61), but provides: - Zero-downtime deployments - Horizontal scaling - Service isolation - Professional monitoring - Disaster recovery

Trade-off: $56/month for operational peace of mind. Worth it.

Mistakes I made¶

1. Forgot readiness probes initially¶

First deployment: pods created, traffic routed immediately, but services not ready yet. 500 errors for 30 seconds while services booted.

Fix: added readinessProbe with initialDelaySeconds: 10. Kubernetes now waits.

2. Resource limits too low¶

Content Classifier service constantly OOMKilled (Out Of Memory). Pod restarts, traffic loss, errors.

Cause: LLM inference needs 3GB memory, but I set limit 1GB.

Fix: profiling via kubectl top pod, increased limit to 4GB. Problem solved.

3. Redis without persistence¶

First Redis deployment was Deployment without PVC. When Redis pod restarted, all cache lost.

Fix: changed to StatefulSet with PVC. Cache persists across restarts.

4. Secrets in Git (initial commit)¶

Initially I committed secrets.yaml with real passwords. Then realized this is public repo.

Fix: 1. git filter-branch to remove secrets from history (painful) 2. Regenerated all passwords 3. Added secrets.yaml to .gitignore 4. Use kubectl create secret instead of YAML files

Lesson learned: never commit secrets. Never.

Conclusions from production deployment¶

Kubernetes deployment for AI agents is not trivial. Questions emerge that didn't exist in local development:

Service discovery → HTTP wrapper + Kubernetes Services Distributed cache → Redis StatefulSet Secrets management → Kubernetes Secrets API + External Secrets Operator Storage → NFS/EFS for multi-node, hostPath for single-node, S3 for production Rolling updates → maxUnavailable: 0 + readiness probes Resource limits → Profiling + 20% buffer Monitoring → Prometheus metrics + Grafana dashboards

Deployment process automated via scripts. Single command deploys 15+ services.

Cost: ~$117/month for production cluster vs $61/month for single VPS. Trade-off: operational reliability for extra $56/month.

Mistakes: forgot readiness probes, resource limits too low, Redis without persistence, secrets in Git. Fixed through iterations.

For EMM transition from Docker Compose to Kubernetes took 2 weeks. Development, testing, deployment, monitoring setup. Result: zero-downtime updates, isolation, scaling ready.

If building multi-service AI platform - Kubernetes provides flexibility and reliability. Initial setup more complex, but long-term operational benefits pay off.

Author: Igor Gorovyy Role: DevOps Engineer Lead & Senior Solutions Architect LinkedIn: linkedin.com/in/gorovyyigor

Deployment summary¶

Environment: Kubernetes 1.28+ Services: 15 pods (3 agents, 7 MCP services, 5 infrastructure) Storage: Redis (10GB PVC), EFS (20GB for vault) Cost: ~$117/month (production), $0 (development) Deployment time: 5 minutes (automated script) Uptime: 99.9% (zero-downtime rolling updates)