Getting Started

System requirements

Docker ≥ 20.x and Docker Compose ≥ 2.x (for Docker install)
Python 3.10+ (for local dev without Docker)
Disk space: backend image ~2GB (PyTorch CPU), Qdrant ~500MB

Quick install (Docker)

Run Backend, Frontend and Qdrant with Docker Compose:

Clone and prepare env:

git clone https://github.com/Lampx83/LakeFlow.git LakeFlow
cd LakeFlow
cp env.example .env   # or cp .env.example .env

Required: Edit .env — set HOST_LAKE_PATH to the absolute path of the Data Lake directory. Examples:
- macOS: HOST_LAKE_PATH=/Users/you/lakeflow_data
- Linux: HOST_LAKE_PATH=/datalake/research
Directory must exist before running.
Create directory if needed: mkdir -p $HOST_LAKE_PATH
Create zones: mkdir -p $HOST_LAKE_PATH/000_inbox $HOST_LAKE_PATH/100_raw $HOST_LAKE_PATH/200_staging $HOST_LAKE_PATH/300_processed $HOST_LAKE_PATH/400_embeddings $HOST_LAKE_PATH/500_catalog
Run: docker compose up --build (or -d for background)

Note: Docker volume uses device: $HOST_LAKE_PATH. Empty variable or non-existent path will cause compose to fail.

After successful startup, services are available at:

Service	URL	Notes
Backend API	http://localhost:8011	Base URL for API calls
Swagger UI	http://localhost:8011/docs	Interactive API docs
Streamlit UI	http://localhost:8012	Login: `admin` / `admin123`
Qdrant	http://localhost:8013	Vector DB (port 6333 in container mapped to 8013)

Project structure

LakeFlow/
├── backend/                   # FastAPI + pipeline scripts (Python)
│   ├── src/lakeflow/          # Main package
│   │   ├── api/               # Routers: auth, search, pipeline, inbox, qdrant, ...
│   │   ├── scripts/           # step0_inbox, step1_raw, step2_staging, step3, step4
│   │   └── ...
│   └── requirements.txt
├── frontend/streamlit/        # Streamlit control UI
│   ├── app.py                 # Entry point
│   ├── pages/                 # Dashboard, Pipeline Runner, Semantic Search, ...
│   └── services/              # api_client, pipeline_service, ...
├── website/                   # Docs site (Next.js)
├── docker-compose.yml        # Docker config
├── env.example               # Env var template
└── README.md

Local development (without Docker)

Run backend and frontend directly on your machine for faster debugging. Requires Python 3.10+.

Qdrant: docker compose up -d qdrant

Backend:

cd backend
python3 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install torch           # Mac M1: install first for Metal (MPS)
pip install -r requirements.txt && pip install -e .
uvicorn lakeflow.main:app --reload --port 8011

Frontend: From repo root:

python frontend/streamlit/dev_with_reload.py
# or: streamlit run frontend/streamlit/app.py

.env in repo root needs: LAKE_ROOT, QDRANT_HOST=localhost, API_BASE_URL=http://localhost:8011. dev_with_reload.py auto-loads .env from repo root.

First run workflow

Create zones (if needed): mkdir -p $HOST_LAKE_PATH/000_inbox $HOST_LAKE_PATH/100_raw $HOST_LAKE_PATH/200_staging $HOST_LAKE_PATH/300_processed $HOST_LAKE_PATH/400_embeddings $HOST_LAKE_PATH/500_catalog
Add files to inbox: Copy PDF/Word/Excel to 000_inbox/<domain>/ (e.g. 000_inbox/regulations/doc.pdf) or call POST /inbox/upload
Run pipeline: Via Streamlit (Pipeline Runner) or API: POST /pipeline/run/step0 → step1 → step2 → step3 → step4
Test search: Semantic Search page in UI or POST /search/semantic

Note: Step3 (embedding) and Semantic Search need Ollama (LLM_BASE_URL). Run ollama pull qwen3-embedding:8b (or your chosen model) first.

Mac M1 / Metal (MPS)

Docker runs Linux so Metal/MPS is not available in container. To use GPU on Mac M1, run backend via venv on macOS: pip install torch first then pip install -r requirements.txt. PyTorch will use MPS.

Build on server without GPU

Backend image defaults to PyTorch CPU-only (~2GB). Requires DOCKER_BUILDKIT=1 when building. DOCKER_BUILDKIT=1 docker compose up --build

Troubleshooting

Compose error: Ensure HOST_LAKE_PATH exists and is an absolute path.
Frontend "Connection refused": Backend must run first; check API_BASE_URL.
Search returns empty: Have you run step3 + step4? Does collection have data? Check EMBED_MODEL matches model used in step3.

Backend API →