Data Lake Architecture
LakeFlow uses a layered Data Lake with 6 zones. Data flow: inbox β raw β staging β processed β embeddings β Qdrant.
Inbox structure (000_inbox)
Place files in 000_inbox/<domain>/. domain is a subfolder name (e.g. regulations, syllabus). Subpaths allowed: 000_inbox/domain/subfolder/subfolder2/.
Supported formats: PDF, .docx, .xlsx, .xls, .pptx, .txt. Domain names use letters, numbers, _, - only.
000_inbox/
βββ regulations/
β βββ doc1.pdf
β βββ subfolder/doc2.docx
βββ syllabus/
βββ course_a.pdfZone descriptions
| Zone | Path | Description |
|---|---|---|
000_inbox | LAKE_ROOT/000_inbox | Drop raw files here. step0 scans and moves to 100_raw. Organized by domain. |
100_raw | LAKE_ROOT/100_raw | step1: Copy + hash (SHA256) + dedup. Structure <domain>/<file_hash>.pdf. Writes catalog SQLite. |
200_staging | LAKE_ROOT/200_staging | step2: Extract text (pdf_analyzer, word_analyzer, excel_analyzer). Each file has validation.json with section structure. |
300_processed | LAKE_ROOT/300_processed | step2 output: Chunking β chunks.json. Each chunk has text, section_id, token_estimate. Ready for embedding. |
400_embeddings | LAKE_ROOT/400_embeddings | step3: Create embedding.npy via Ollama. Local cache before pushing to Qdrant. |
500_catalog | LAKE_ROOT/500_catalog | Metadata catalog (SQLite). Vectors stored in Qdrant (step4), not in 500_catalog. |
Pipeline steps
| Step | Script | Input | Output |
|---|---|---|---|
step0 | step0_inbox.py | 000_inbox | Scan inbox, ingestion β 100_raw (hash, dedup, write catalog). Can use only_folders. |
step1 | step1_raw.py | 000_inbox (via catalog) | 100_raw: copy file, hash, dedup |
step2 | step2_staging.py | 100_raw | 200_staging (validation.json), 300_processed (chunks.json) |
step3 | step3_processed_files.py | 300_processed | 400_embeddings: embedding.npy (Ollama). Model selectable via UI or embed_model in API. |
step4 | step3_processed_qdrant.py | 400_embeddings | Push vectors to Qdrant collection. Can specify collection_name, qdrant_url. |
Important file structures
- validation.json (200_staging): Metadata, sections, extracted text
- chunks.json (300_processed): Array of chunks, each with text, section_id, chunk_id, token_estimate
- embedding.npy (400_embeddings): Numpy array of vectors, shape (n_chunks, dim)
Idempotency
Pipeline is idempotent: re-running same data yields deterministic UUIDs in Qdrant. Use force_rerun to overwrite cached results.
Supported formats (details)
PDF, Word (.docx), Excel (.xlsx, .xls), PPTX, TXT. Staging uses analyzers (pdf_analyzer, word_analyzer, excel_analyzer) to extract text and structure (tables, headings).
Run pipeline by domain
API POST /pipeline/run/{step} accepts only_folders β list of domains or file_hashes. Only processes selected folders.
# Run regulations domain only
{"only_folders": ["regulations"]}
# Run multiple domains
{"only_folders": ["regulations", "syllabus"]}NAS-friendly
SQLite uses non-WAL mode so it runs on Synology/NFS. Set LAKE_ROOT to point to path on NAS if needed.