Skip to content

Kubernetes Deployment for AI Agents: Real-World Experience with LangGraph

Українська версія доступна

Kubernetes Deployment Architecture

EMM (Expert Memory Machine) started as a local experiment. Docker Compose, three containers, 5GB RAM. It worked. Agents classified files, saved to Obsidian vault, everything local.

Then came the moment when scale was needed. Not performance scale (traffic was minimal), but operational scale: multiple environments, isolated services, proper monitoring, automated deployments.

Docker Compose doesn't fit this. Kubernetes does.

But transitioning from monolith to microservices isn't just "write deployment.yaml and kubectl apply". Questions emerge that didn't exist in local development:

How do agents find MCP services? Service discovery. Where to store cache? In-memory doesn't work with multiple pods. Where to put secrets? Environment variables in Git - bad idea. How to do rolling updates without downtime?

Here's how I solved these problems. With real manifests, mistakes, and solutions that work.

Architecture before and after

Before: Docker Compose monolith

docker-compose.yml:
  ├─ langgraph-api (contains all agents)
  ├─ ollama (LLM)
  └─ redis (optional, rarely used)

Problems: - All agents in one container. One crash = entire system down. - MCP handlers - in-process calls. Works locally, doesn't scale. - Cache in-memory in MCPClient. Works for single instance, lost on restart. - Secrets in .env file. Git-ignored, but manual sync between environments.

After: Kubernetes microservices

Kubernetes cluster:
  ├─ Agent Pods (3)
  │   ├─ confluence-agent
  │   ├─ bookmark-scraper
  │   └─ file-system-agent
  ├─ MCP Service Pods (7)
  │   ├─ jd-classifier-service
  │   ├─ content-classifier-service
  │   ├─ bookmark-classifier-service
  │   ├─ confluence-service
  │   ├─ file-system-service
  │   ├─ web-scraper-service
  │   └─ notifications-service
  ├─ Infrastructure (2)
  │   ├─ redis-cache (StatefulSet)
  │   └─ langgraph-api
  └─ Monitoring (3)
      ├─ prometheus
      ├─ tempo
      └─ grafana

15+ pods instead of 3 containers. Seems like overkill. But each pod has clear responsibility, scales easily, and can fail independently.

Challenge 1: Service Discovery

Locally MCP Client called handlers directly:

# mcp-servers/client.py - local version
def call(self, uri: str, params: dict):
    handler = self._handlers.get(uri)
    return handler(params)  # direct call

Works. Fast. But in Kubernetes agents and MCP services are different pods. Direct call doesn't work.

Need HTTP.

Solution: HTTP wrapper + Kubernetes Services

Each MCP service became HTTP server:

# mcp-servers/jd-classifier/server.py
from fastapi import FastAPI

app = FastAPI()

@app.post("/get_structure")
def get_structure(request: dict):
    # existing handler logic
    return {"structure": ...}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

Kubernetes Service provides stable DNS name:

apiVersion: v1
kind: Service
metadata:
  name: jd-classifier-service
  namespace: agentic-ai
spec:
  selector:
    app: jd-classifier-service
  ports:
  - port: 8080
    targetPort: 8080

MCP Client now makes HTTP requests:

# Updated client.py
def call(self, uri: str, params: dict):
    if os.getenv("K8S_MODE") == "true":
        service_name = self._parse_service(uri)
        url = f"http://{service_name}-service:8080/{uri}"
        response = requests.post(url, json=params)
        return response.json()
    else:
        # local mode - direct handlers
        return self._handlers[uri](params)

Environment variable K8S_MODE=true switches between local and Kubernetes mode. Same code, two environments.

Performance impact

HTTP overhead: ~5-10ms per request vs 0ms for direct call.

But this is negligible. Agents don't make thousands of requests per second. Typical workflow: 10-50 MCP calls per agent run. 10ms * 50 = 500ms overhead. Acceptable.

Trade-off: 500ms latency for isolation, independent scaling, and fault tolerance. Worth it.

Challenge 2: Distributed Cache

Locally MCPClient had in-memory cache:

class MCPClient:
    def __init__(self):
        self._cache = {}  # in-memory dict

    def get(self, key: str):
        return self._cache.get(key)

Works for single process. But in Kubernetes:

Agent Pod 1 → writes to cache → stored in memory
Agent Pod 2 → reads from cache → MISS (different memory)

Each pod has its own memory space. Cache not shared.

Solution: Redis StatefulSet

Redis - distributed cache. Shared memory between pods.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cache
spec:
  serviceName: redis-service
  replicas: 1
  template:
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        ports:
        - containerPort: 6379
        volumeMounts:
        - name: redis-data
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: redis-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

StatefulSet instead of Deployment because Redis needs persistent storage and stable network identity.

MCPClient connects to Redis:

import redis

class MCPClient:
    def __init__(self):
        if os.getenv("K8S_MODE") == "true":
            self._cache = redis.Redis(
                host=os.getenv("REDIS_HOST", "redis-service"),
                port=int(os.getenv("REDIS_PORT", "6379")),
                decode_responses=True
            )
        else:
            self._cache = {}  # local fallback

Now cache is shared:

Agent Pod 1 → writes to Redis → stored in Redis PVC
Agent Pod 2 → reads from Redis → HIT (same Redis instance)

Redis performance

Latency: ~1-2ms for GET/SET in Kubernetes cluster (same AZ).

Cache hit ratio: ~85% for JD structure lookups (frequently accessed).

Memory usage: 50MB for typical workload (5000 files indexed).

Persistence: RDB snapshots every 5 minutes + AOF for durability.

Challenge 3: Secrets Management

Locally secrets in .env file:

CONFLUENCE_BASE_URL=https://...
CONFLUENCE_USERNAME=[email protected]
CONFLUENCE_PASSWORD=super_secret_password

Git-ignored. Manually copied between machines. Not scalable. Not secure.

Kubernetes has Secrets API:

apiVersion: v1
kind: Secret
metadata:
  name: confluence-credentials
  namespace: agentic-ai
type: Opaque
stringData:
  base_url: "https://your-confluence.atlassian.net"
  username: "your-username"
  password: "your-password"

Pods mount secrets as environment variables:

spec:
  containers:
  - name: confluence-service
    env:
    - name: CONFLUENCE_BASE_URL
      valueFrom:
        secretKeyRef:
          name: confluence-credentials
          key: base_url

Secrets creation - don't commit to Git

⚠️ IMPORTANT: Don't add secrets to manifests that go to Git. Use:

Option 1: kubectl create secret

kubectl create secret generic confluence-credentials \
  --from-literal=base_url="https://..." \
  --from-literal=username="..." \
  --from-literal=password="..." \
  --namespace agentic-ai

Option 2: External Secrets Operator

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: confluence-credentials
spec:
  secretStoreRef:
    name: aws-secretsmanager
  target:
    name: confluence-credentials
  data:
  - secretKey: password
    remoteRef:
      key: prod/confluence/password

Secrets stored in AWS Secrets Manager / HashiCorp Vault, not Git. External Secrets Operator syncs them to Kubernetes.

Option 3: Sealed Secrets

# Encrypt secret
kubeseal < secret.yaml > sealed-secret.yaml

# Commit sealed-secret.yaml to Git (encrypted)
git add sealed-secret.yaml

# SealedSecret controller decrypts in cluster

I use Option 1 for development, Option 2 for production.

Challenge 4: Persistent Storage

Obsidian vault and UNSORTED folder - where to store them?

Locally: hostPath (~/vault, ~/unsorted). Works on laptop.

Kubernetes: pods are ephemeral. When pod restarts, filesystem is lost.

Solution depends on environment

Single-node Kubernetes (development):

hostPath works:

volumes:
- name: vault-storage
  hostPath:
    path: /mnt/data/vault
    type: DirectoryOrCreate

Pod mounts host filesystem. Works for Minikube, Docker Desktop, single-node k3s.

Multi-node Kubernetes (production):

hostPath doesn't work - pods can schedule on different nodes.

Need shared storage:

Option 1: NFS

volumes:
- name: vault-storage
  nfs:
    server: nfs-server.example.com
    path: /exports/vault

NFS server accessible from all nodes. Pods on any node can read/write.

Option 2: Cloud storage (AWS EFS, Google Filestore)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vault-pvc
spec:
  accessModes: ["ReadWriteMany"]
  storageClassName: "efs-sc"
  resources:
    requests:
      storage: 20Gi

EFS - managed NFS from AWS. Zero maintenance. Automatic scaling. Cost: ~$0.30/GB/month.

Option 3: S3 (lakeFS integration)

For EMM I chose hybrid approach: - Development: hostPath (local testing) - Production: S3 via lakeFS (versioning + cloud storage)

File System Agent works with storage_backend.py abstraction - backend configuration via environment variables.

Challenge 5: Rolling Updates without Downtime

Docker Compose update strategy:

docker-compose down
docker-compose pull
docker-compose up -d

Downtime: ~30 seconds while containers restart.

Kubernetes has rolling updates:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

maxUnavailable: 0 means: don't shut down old pod before new pod is ready.

Process:

  1. Kubernetes creates new pod with new image
  2. New pod passes readiness probe
  3. When new pod ready, traffic switches to it
  4. Old pod gracefully shuts down
  5. Repeat for each replica

Downtime: zero.

Readiness Probes - critically important

Without readiness probe Kubernetes immediately routes traffic to new pod, even if it's not ready yet. Result: 500 errors.

With readiness probe:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

Kubernetes waits until /health endpoint returns 200 OK. Only then routes traffic.

MCP services have /health endpoint:

@app.get("/health")
def health_check():
    # Check dependencies
    redis_ok = check_redis_connection()
    llm_ok = check_llm_connection()

    if redis_ok and llm_ok:
        return {"status": "healthy"}
    else:
        raise HTTPException(status_code=503, detail="unhealthy")

If dependencies are failing, /health returns 503. Kubernetes doesn't route traffic to this pod.

Challenge 6: Resource Limits

Without resource limits pods can consume all cluster memory/CPU.

Content Classifier especially memory-hungry (LLM inference):

resources:
  requests:
    memory: "2Gi"
    cpu: "1"
  limits:
    memory: "4Gi"
    cpu: "2"

requests: minimum resources Kubernetes guarantees. Scheduler won't place pod on node without these resources.

limits: maximum resources pod can use. If pod exceeds memory limit, Kubernetes kills pod (OOMKilled).

I set limits through profiling:

  1. Ran pod without limits
  2. Monitored memory/CPU usage via kubectl top pod
  3. Peak memory: 3.2GB (during batch classification 100 files)
  4. Set limit: 4GB (20% buffer)

Similarly for CPU: peak 1.5 cores, set limit 2 cores.

JD Classifier service - lightweight (only YAML parsing):

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "500m"

100m = 0.1 CPU core. 128Mi = 128 megabytes. Sufficient for parsing jd.yaml.

Deployment Process - Automation

I don't write kubectl apply manually for 15+ manifests. Automated via scripts.

setup-from-env.sh - Generate Configs

#!/bin/bash
# Reads .env file, generates Kubernetes manifests

source .env

# Generate secret for Confluence
kubectl create secret generic confluence-credentials \
  --from-literal=base_url="$CONFLUENCE_BASE_URL" \
  --from-literal=username="$CONFLUENCE_USERNAME" \
  --from-literal=password="$CONFLUENCE_PASSWORD" \
  --namespace agentic-ai \
  --dry-run=client -o yaml > manifests/01-secrets.yaml

# Generate ConfigMap for JD structure
kubectl create configmap jd-structure \
  --from-file=jd.yaml \
  --namespace agentic-ai \
  --dry-run=client -o yaml > manifests/02-configmap.yaml

--dry-run=client -o yaml generates YAML manifest without applying to cluster. Output redirected to file.

Result: secrets and configs in YAML format, ready for kubectl apply.

build-and-deploy.sh - Build Images + Deploy

#!/bin/bash

# Build Docker images
docker build -t agentic-ai/jd-classifier-service:latest \
  -f mcp-servers/jd-classifier/Dockerfile .

docker build -t agentic-ai/content-classifier-service:latest \
  -f mcp-servers/content-classifier/Dockerfile .

# ... build other services ...

# Tag images for registry
docker tag agentic-ai/jd-classifier-service:latest \
  my-registry.com/jd-classifier-service:latest

# Push to registry
docker push my-registry.com/jd-classifier-service:latest

# Apply manifests
kubectl apply -f manifests/00-namespace.yaml
kubectl apply -f manifests/01-secrets.yaml
kubectl apply -f manifests/02-configmap.yaml
kubectl apply -f manifests/03-redis.yaml
kubectl apply -f manifests/04-mcp-services.yaml
kubectl apply -f manifests/05-agents.yaml

# Wait for rollout
kubectl rollout status deployment/jd-classifier-service -n agentic-ai
kubectl rollout status deployment/content-classifier-service -n agentic-ai

Single command: ./deploy/kubernetes/build-and-deploy.sh

Automatically builds, pushes, deploys. Time: ~5 minutes for full deployment.

Monitoring - Prometheus + Grafana

Kubernetes provides metrics out of the box, but need to collect them.

Prometheus - Metrics Collection

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Prometheus scrapes metrics from pods with annotation prometheus.io/scrape: "true".

MCP services expose metrics:

from prometheus_client import Counter, Histogram, generate_latest

requests_total = Counter('mcp_requests_total', 'Total MCP requests')
request_duration = Histogram('mcp_request_duration_seconds', 'MCP request duration')

@app.post("/classify")
def classify(request: dict):
    with request_duration.time():
        requests_total.inc()
        # handler logic
        return result

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type="text/plain")

Prometheus scrapes /metrics endpoint every 15 seconds.

Grafana - Visualization

Dashboard queries:

# Request rate
rate(mcp_requests_total[5m])

# Error rate
rate(mcp_requests_total{status="error"}[5m])

# P95 latency
histogram_quantile(0.95, mcp_request_duration_seconds_bucket)

# Memory usage
container_memory_usage_bytes{pod=~".*-service.*"}

Dashboard shows: - Request rate per service - Error rate - Latency P50/P95/P99 - Memory/CPU usage - Pod restarts

Grafana alerting: if error rate > 5% or P95 latency > 2s, send Slack notification.

Cost Analysis - What it costs

Development environment (Minikube on laptop): $0.

Production environment (managed Kubernetes):

Compute: - 3 worker nodes (t3.medium): $0.0416/hour × 3 × 730 hours = ~$91/month - 15 pods, average 0.5 CPU, 1GB RAM per pod: fits on 3 nodes

Storage: - Redis PVC: 10GB × $0.10/GB/month = $1/month - EFS for vault: 20GB × $0.30/GB/month = $6/month

Networking: - LoadBalancer: $18/month (AWS ELB) - Data transfer: ~$1/month (internal traffic free)

Total: ~$117/month for production-grade Kubernetes deployment.

Alternative (Docker Compose on single VPS): - t3.large instance: $0.0832/hour × 730 = ~$61/month

Kubernetes more expensive ($117 vs $61), but provides: - Zero-downtime deployments - Horizontal scaling - Service isolation - Professional monitoring - Disaster recovery

Trade-off: $56/month for operational peace of mind. Worth it.

Mistakes I made

1. Forgot readiness probes initially

First deployment: pods created, traffic routed immediately, but services not ready yet. 500 errors for 30 seconds while services booted.

Fix: added readinessProbe with initialDelaySeconds: 10. Kubernetes now waits.

2. Resource limits too low

Content Classifier service constantly OOMKilled (Out Of Memory). Pod restarts, traffic loss, errors.

Cause: LLM inference needs 3GB memory, but I set limit 1GB.

Fix: profiling via kubectl top pod, increased limit to 4GB. Problem solved.

3. Redis without persistence

First Redis deployment was Deployment without PVC. When Redis pod restarted, all cache lost.

Fix: changed to StatefulSet with PVC. Cache persists across restarts.

4. Secrets in Git (initial commit)

Initially I committed secrets.yaml with real passwords. Then realized this is public repo.

Fix: 1. git filter-branch to remove secrets from history (painful) 2. Regenerated all passwords 3. Added secrets.yaml to .gitignore 4. Use kubectl create secret instead of YAML files

Lesson learned: never commit secrets. Never.

Conclusions from production deployment

Kubernetes deployment for AI agents is not trivial. Questions emerge that didn't exist in local development:

Service discovery → HTTP wrapper + Kubernetes Services Distributed cache → Redis StatefulSet Secrets management → Kubernetes Secrets API + External Secrets Operator Storage → NFS/EFS for multi-node, hostPath for single-node, S3 for production Rolling updates → maxUnavailable: 0 + readiness probes Resource limits → Profiling + 20% buffer Monitoring → Prometheus metrics + Grafana dashboards

Deployment process automated via scripts. Single command deploys 15+ services.

Cost: ~$117/month for production cluster vs $61/month for single VPS. Trade-off: operational reliability for extra $56/month.

Mistakes: forgot readiness probes, resource limits too low, Redis without persistence, secrets in Git. Fixed through iterations.

For EMM transition from Docker Compose to Kubernetes took 2 weeks. Development, testing, deployment, monitoring setup. Result: zero-downtime updates, isolation, scaling ready.

If building multi-service AI platform - Kubernetes provides flexibility and reliability. Initial setup more complex, but long-term operational benefits pay off.


Related: Data Versioning for AI Agents with lakeFS, Developing and Testing AI Agents


Author: Igor Gorovyy Role: DevOps Engineer Lead & Senior Solutions Architect LinkedIn: linkedin.com/in/gorovyyigor

Deployment summary

Environment: Kubernetes 1.28+ Services: 15 pods (3 agents, 7 MCP services, 5 infrastructure) Storage: Redis (10GB PVC), EFS (20GB for vault) Cost: ~$117/month (production), $0 (development) Deployment time: 5 minutes (automated script) Uptime: 99.9% (zero-downtime rolling updates)