GitHub Get Started

LakeFlow Documentation

Data Lake pipeline for RAG: ingest documents, extract text, embed, store in Qdrant. Use with semantic search, Q&A, or AI Portal agents.

Prerequisites

Docker and Docker Compose
Disk space: backend ~2GB, Qdrant ~500MB
Optional: Ollama for embeddings and Q&A
Optional: Python 3.10+ for local dev

Recommended reading order

Getting Started — install and first run
Backend API — REST endpoints
Data Lake — zones and pipeline steps
Configuration — environment variables

Quick checklist

Set HOST_LAKE_PATH in .env
Create Data Lake zones (000_inbox, 100_raw, etc.)
Run docker compose up
Add files to 000_inbox or use POST /inbox/upload
Run pipeline step0→step4, then semantic search or Q&A

Documentation sections

Click a card to open the section.

Getting Started

Install with Docker, create zones, first run.

Backend API

Auth, search, pipeline, inbox, admission agent.

Frontend (Streamlit)

Pipeline Runner, Semantic Search, Q&A, System Settings.

Data Lake

Zone layout, pipeline steps, supported formats.

Configuration

Environment variables, .env example.

Deployment

Portainer, manual deploy, GitHub Actions.

Troubleshooting

Compose fails: check HOST_LAKE_PATH exists
Frontend connection refused: ensure backend is running
Search empty: run step3 + step4, check EMBED_MODEL
Ollama not found: set LLM_BASE_URL, run ollama pull

Tips

Use the sidebar to jump between sections.
Run locally with Docker and follow Getting Started.
Swagger UI at /docs when backend runs.
Admission agent is an example for AI Portal integration.

Quick links

GitHub — Lampx83/LakeFlow
PyPI — lake-flow-pipeline
Swagger UI: http://localhost:8011/docs (when backend is running)
ReDoc: http://localhost:8011/redoc

Next:Getting Started →