Data Lake Architecture

LakeFlow uses a layered Data Lake with 6 zones. Data flow: inbox → raw → staging → processed → embeddings → Qdrant.

Inbox structure (000_inbox)

Place files in 000_inbox/<domain>/. domain is a subfolder name (e.g. regulations, syllabus). Subpaths allowed: 000_inbox/domain/subfolder/subfolder2/.

Supported formats: PDF, .docx, .xlsx, .xls, .pptx, .txt. Domain names use letters, numbers, _, - only.

000_inbox/
├── regulations/
│   ├── doc1.pdf
│   └── subfolder/doc2.docx
└── syllabus/
    └── course_a.pdf

Zone descriptions

Zone	Path	Description
`000_inbox`	`LAKE_ROOT/000_inbox`	Drop raw files here. step0 scans and moves to 100_raw. Organized by domain.
`100_raw`	`LAKE_ROOT/100_raw`	step1: Copy + hash (SHA256) + dedup. Structure <domain>/<file_hash>.pdf. Writes catalog SQLite.
`200_staging`	`LAKE_ROOT/200_staging`	step2: Extract text (pdf_analyzer, word_analyzer, excel_analyzer). Each file has validation.json with section structure.
`300_processed`	`LAKE_ROOT/300_processed`	step2 output: Chunking → chunks.json. Each chunk has text, section_id, token_estimate. Ready for embedding.
`400_embeddings`	`LAKE_ROOT/400_embeddings`	step3: Create embedding.npy via Ollama. Local cache before pushing to Qdrant.
`500_catalog`	`LAKE_ROOT/500_catalog`	Metadata catalog (SQLite). Vectors stored in Qdrant (step4), not in 500_catalog.

Pipeline steps

Step	Script	Input	Output
`step0`	step0_inbox.py	000_inbox	Scan inbox, ingestion → 100_raw (hash, dedup, write catalog). Can use only_folders.
`step1`	step1_raw.py	000_inbox (via catalog)	100_raw: copy file, hash, dedup
`step2`	step2_staging.py	100_raw	200_staging (validation.json), 300_processed (chunks.json)
`step3`	step3_processed_files.py	300_processed	400_embeddings: embedding.npy (Ollama). Model selectable via UI or embed_model in API.
`step4`	step3_processed_qdrant.py	400_embeddings	Push vectors to Qdrant collection. Can specify collection_name, qdrant_url.

Important file structures

validation.json (200_staging): Metadata, sections, extracted text
chunks.json (300_processed): Array of chunks, each with text, section_id, chunk_id, token_estimate
embedding.npy (400_embeddings): Numpy array of vectors, shape (n_chunks, dim)

Idempotency

Pipeline is idempotent: re-running same data yields deterministic UUIDs in Qdrant. Use force_rerun to overwrite cached results.

Supported formats (details)

PDF, Word (.docx), Excel (.xlsx, .xls), PPTX, TXT. Staging uses analyzers (pdf_analyzer, word_analyzer, excel_analyzer) to extract text and structure (tables, headings).

Run pipeline by domain

API POST /pipeline/run/{step} accepts only_folders — list of domains or file_hashes. Only processes selected folders.

# Run regulations domain only
{"only_folders": ["regulations"]}

# Run multiple domains
{"only_folders": ["regulations", "syllabus"]}

NAS-friendly

SQLite uses non-WAL mode so it runs on Synology/NFS. Set LAKE_ROOT to point to path on NAS if needed.

← Frontend (Streamlit)Configuration →