Izabella assistant. Classification module. Migration from Confluence to owns storage.

Written by:

Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect

Diagram

graph LR
    subgraph User
        A[ Author / Reader]
    end

    subgraph Web UI
        B[Markdown Editor / Viewer]
        C[JD Tree Navigator]
        D[Search Field - LLM + Vector]
    end

    subgraph Content Layer
        E[Content Service - Golang API]
        F[JD Classifier - LLM / Embedding]
        G[Markdown Renderer]
    end

    subgraph Storage
        H[File System / Git / S3]
        I[JD Structure File - jd.yaml]
        J[Attachments Bucket]
    end

    subgraph Indexing Layer
        K[Embedding Generator - OpenAI API]
        L[Vector DB - Qdrant / Weaviate]
    end

    subgraph LLM Layer
        M[LLM Gateway - OpenAI / Ollama]
        N[RAG Processor - LangChain / Custom]
    end

    A --> B
    B --> E
    C --> E
    D --> N

    E -->|Save Markdown| H
    E -->|Read JD Structure| I
    E -->|Store Attachments| J

    E --> F
    F -->|JD Code| E

    E --> K
    K --> L
    D --> L
    L --> N
    N --> M
    M --> A

🧠 Knowledge System Architecture Review: Markdown + Vector + Johnny.Decimal

❌ Potential Risks & Limitations

1. Slow Access and Filtering

JD structure stored in static jd.yaml or file system — not indexed.
No native filtering, grouping, or querying across JD hierarchy.

2. Embedding During Save is Blocking

Synchronous OpenAI embedding generation slows down document save.
Risk of rate limiting and timeout under high load.

3. RAG Quality Degrades Without Contextual Chunking

If chunks lack JD context (code, section, etc), retrieval may be irrelevant.
LLMs might hallucinate or retrieve semantically similar but functionally incorrect data.

4. Git/FS Backend Doesn't Scale Linearly

Git breaks under very large file trees (>10k pages), slow history access.
File system lacks full-text search, relational query, and concurrency control.

5. Missing Access Control

No RBAC or per-user access levels in Markdown alone.
Impossible to enforce read/write policies or audit access without external systems.

✅ Improvements for Current Architecture

1. Asynchronous Embedding Pipeline

Introduce a task queue (Redis Queue, NATS, or SQS).
Save Markdown immediately, queue embedding request for background processing.

2. JD Structure in a Database

Load jd.yaml into SQL (jd_codes table).
Allow filtering/searching JD codes, multilingual labels, parent-child mapping.

3. Chunk-Aware Vector Embedding

Chunk with:
jd_code
page_title
local_heading
Boosts retrieval accuracy and context during RAG.

🧭 Future-Proofing the System

1. Structured Content Storage

Store Markdown frontmatter in PostgreSQL or SQLite.
Add vector support via pgvector or Qdrant with metadata.

2. Vector DB Features

Use filterable, scalable vector DB: Qdrant, Weaviate, or Vespa.
Filter by: jd_code, created_by, language, tags.

3. Role-Based Access Control (RBAC)

Integrate OpenID Connect (Auth0, Keycloak, AWS Cognito).
Add access metadata in frontmatter: visible_for: [team1, admin].
Log access and mutation events.

4. JD Classifier Moderation UI

Interface to show JD predictions, allow reviewer confirmation.
Track changes to JD classification (e.g. page migrated from 30.03 → 70.02).

5. GitOps Integration

Knowledge base stored in Git → enable CI/CD pipelines:
JD structure validation
Broken link checks
Auto-index new pages
Versioned RAG-ready snapshots

📋 Feature Backlog (Epics)

Epic 1: JD-Aware Storage System

Migrate jd.yaml to SQL DB schema
Create API for JD structure querying
Enable JD code suggestions during editing

Epic 2: Async Embedding Pipeline

Setup Redis Queue or NATS
Decouple save → embedding via worker
Implement retries, dead-letter queue

Epic 3: Vector Search Optimization

Add JD-aware chunking logic
Store chunk metadata in Vector DB
Evaluate embedding models (OpenAI, E5, DeepSeek)

Epic 4: Access & Moderation Layer

Integrate Auth0 / Keycloak
Define frontmatter-based RBAC rules
Add audit logging (create/edit/view)
Build JD-classification review screen

Epic 5: GitOps Automation

GitHub Action to validate JD structure
Action to auto-index changed files
Export vector metadata snapshots to JSON