LakeFlow
Get Started

Data Lake Architecture

LakeFlow uses a layered Data Lake with 6 zones. Data flow: inbox β†’ raw β†’ staging β†’ processed β†’ embeddings β†’ Qdrant.

Inbox structure (000_inbox)

Place files in 000_inbox/<domain>/. domain is a subfolder name (e.g. regulations, syllabus). Subpaths allowed: 000_inbox/domain/subfolder/subfolder2/.

Supported formats: PDF, .docx, .xlsx, .xls, .pptx, .txt. Domain names use letters, numbers, _, - only.

000_inbox/
β”œβ”€β”€ regulations/
β”‚   β”œβ”€β”€ doc1.pdf
β”‚   └── subfolder/doc2.docx
└── syllabus/
    └── course_a.pdf

Zone descriptions

ZonePathDescription
000_inboxLAKE_ROOT/000_inboxDrop raw files here. step0 scans and moves to 100_raw. Organized by domain.
100_rawLAKE_ROOT/100_rawstep1: Copy + hash (SHA256) + dedup. Structure <domain>/<file_hash>.pdf. Writes catalog SQLite.
200_stagingLAKE_ROOT/200_stagingstep2: Extract text (pdf_analyzer, word_analyzer, excel_analyzer). Each file has validation.json with section structure.
300_processedLAKE_ROOT/300_processedstep2 output: Chunking β†’ chunks.json. Each chunk has text, section_id, token_estimate. Ready for embedding.
400_embeddingsLAKE_ROOT/400_embeddingsstep3: Create embedding.npy via Ollama. Local cache before pushing to Qdrant.
500_catalogLAKE_ROOT/500_catalogMetadata catalog (SQLite). Vectors stored in Qdrant (step4), not in 500_catalog.

Pipeline steps

StepScriptInputOutput
step0step0_inbox.py000_inboxScan inbox, ingestion β†’ 100_raw (hash, dedup, write catalog). Can use only_folders.
step1step1_raw.py000_inbox (via catalog)100_raw: copy file, hash, dedup
step2step2_staging.py100_raw200_staging (validation.json), 300_processed (chunks.json)
step3step3_processed_files.py300_processed400_embeddings: embedding.npy (Ollama). Model selectable via UI or embed_model in API.
step4step3_processed_qdrant.py400_embeddingsPush vectors to Qdrant collection. Can specify collection_name, qdrant_url.

Important file structures

  • validation.json (200_staging): Metadata, sections, extracted text
  • chunks.json (300_processed): Array of chunks, each with text, section_id, chunk_id, token_estimate
  • embedding.npy (400_embeddings): Numpy array of vectors, shape (n_chunks, dim)

Idempotency

Pipeline is idempotent: re-running same data yields deterministic UUIDs in Qdrant. Use force_rerun to overwrite cached results.

Supported formats (details)

PDF, Word (.docx), Excel (.xlsx, .xls), PPTX, TXT. Staging uses analyzers (pdf_analyzer, word_analyzer, excel_analyzer) to extract text and structure (tables, headings).

Run pipeline by domain

API POST /pipeline/run/{step} accepts only_folders β€” list of domains or file_hashes. Only processes selected folders.

# Run regulations domain only
{"only_folders": ["regulations"]}

# Run multiple domains
{"only_folders": ["regulations", "syllabus"]}

NAS-friendly

SQLite uses non-WAL mode so it runs on Synology/NFS. Set LAKE_ROOT to point to path on NAS if needed.