Data Lake pipeline for RAG. Self-host in minutes.

From documents to semantic search

Ingest PDF, Word, Excel → embed → Qdrant. Q&A, agents, AI Portal integration.

docker compose up --build

or pipx run lake-flow-pipeline init

Get Started

Data Lake pipeline for RAG

Inbox → raw → staging → processed → embeddings → Qdrant. Run step0 to step4. Use with AI Portal agents.

⚙️

Backend (FastAPI)

REST API: Auth, Search (embed, semantic, Q&A), Pipeline (step0–step4), System, Qdrant proxy, Inbox, Admission agent.

🖥️

Frontend (Streamlit)

Control UI: Dashboard, Data Lake Explorer, Pipeline Runner, Semantic Search, Q&A with AI, Qdrant Inspector.

📁

Layered Data Lake

Zones: 000_inbox → 100_raw → 200_staging → 300_processed → 400_embeddings → 500_catalog. Hash, dedup, catalog.

🔍

Semantic search & embed

POST /search/embed (text→vector), /search/semantic (Qdrant), /search/qa (RAG). Qdrant vector store.

🐳

Docker-first

Backend, frontend, Qdrant via Docker Compose. No Python on host. venv Mac M1 for GPU (Metal/MPS).

🐍

Python & FastAPI

Python 3.10+, FastAPI, sentence-transformers, Qdrant. Easy to extend. PyPI: lake-flow-pipeline.

Solutions for every use case

From research to regulations. LakeFlow adapts to your document types.

For developers

Integrate LakeFlow into your Python stack. REST APIs, Docker, FastAPI backend, Streamlit UI.

Quick start →

For data teams

Ingest documents, run pipelines via UI or API. Embedding and semantic search for RAG and LLM.

Documentation →

Enterprise

Self-host on your infrastructure. NAS compatible (SQLite without WAL). Full data control.

Deployment guide →

Built for developers

Docker Compose, REST API, Streamlit UI. Full control of your pipeline.

Quick start

Create a LakeFlow project in one command and run with Docker.

pipx run lake-flow-pipeline init

Documentation

Full docs: Backend API, Frontend UI, Data Lake, Configuration, Deployment.

GitHub

Source code, issues, and contributions.

PyPI

Package lake-flow-pipeline — pip install, available on pypi.org.

Ready to deploy?

Start with LakeFlow. Run with Docker in minutes.

Get Started View on GitHub

Open source. You own your data.