Getting Started
System requirements
- Docker β₯ 20.x and Docker Compose β₯ 2.x (for Docker install)
- Python 3.10+ (for local dev without Docker)
- Disk space: backend image ~2GB (PyTorch CPU), Qdrant ~500MB
Quick install (Docker)
Run Backend, Frontend and Qdrant with Docker Compose:
- Clone and prepare env:
git clone https://github.com/Lampx83/LakeFlow.git LakeFlow cd LakeFlow cp env.example .env # or cp .env.example .env
- Required: Edit .env β set HOST_LAKE_PATH to the absolute path of the Data Lake directory. Examples:
- macOS:
HOST_LAKE_PATH=/Users/you/lakeflow_data - Linux:
HOST_LAKE_PATH=/datalake/research
- macOS:
- Create directory if needed:
mkdir -p $HOST_LAKE_PATH - Create zones:
mkdir -p $HOST_LAKE_PATH/000_inbox $HOST_LAKE_PATH/100_raw $HOST_LAKE_PATH/200_staging $HOST_LAKE_PATH/300_processed $HOST_LAKE_PATH/400_embeddings $HOST_LAKE_PATH/500_catalog - Run:
docker compose up --build(or -d for background)
Note: Docker volume uses device: $HOST_LAKE_PATH. Empty variable or non-existent path will cause compose to fail.
After successful startup, services are available at:
| Service | URL | Notes |
|---|---|---|
| Backend API | http://localhost:8011 | Base URL for API calls |
| Swagger UI | http://localhost:8011/docs | Interactive API docs |
| Streamlit UI | http://localhost:8012 | Login: admin / admin123 |
| Qdrant | http://localhost:8013 | Vector DB (port 6333 in container mapped to 8013) |
Project structure
LakeFlow/ βββ backend/ # FastAPI + pipeline scripts (Python) β βββ src/lakeflow/ # Main package β β βββ api/ # Routers: auth, search, pipeline, inbox, qdrant, ... β β βββ scripts/ # step0_inbox, step1_raw, step2_staging, step3, step4 β β βββ ... β βββ requirements.txt βββ frontend/streamlit/ # Streamlit control UI β βββ app.py # Entry point β βββ pages/ # Dashboard, Pipeline Runner, Semantic Search, ... β βββ services/ # api_client, pipeline_service, ... βββ website/ # Docs site (Next.js) βββ docker-compose.yml # Docker config βββ env.example # Env var template βββ README.md
Local development (without Docker)
Run backend and frontend directly on your machine for faster debugging. Requires Python 3.10+.
- Qdrant:
docker compose up -d qdrant - Backend:
cd backend python3 -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install torch # Mac M1: install first for Metal (MPS) pip install -r requirements.txt && pip install -e . uvicorn lakeflow.main:app --reload --port 8011
- Frontend: From repo root:
python frontend/streamlit/dev_with_reload.py # or: streamlit run frontend/streamlit/app.py
.env in repo root needs: LAKE_ROOT, QDRANT_HOST=localhost, API_BASE_URL=http://localhost:8011. dev_with_reload.py auto-loads .env from repo root.
First run workflow
- Create zones (if needed):
mkdir -p $HOST_LAKE_PATH/000_inbox $HOST_LAKE_PATH/100_raw $HOST_LAKE_PATH/200_staging $HOST_LAKE_PATH/300_processed $HOST_LAKE_PATH/400_embeddings $HOST_LAKE_PATH/500_catalog - Add files to inbox: Copy PDF/Word/Excel to 000_inbox/<domain>/ (e.g. 000_inbox/regulations/doc.pdf) or call POST /inbox/upload
- Run pipeline: Via Streamlit (Pipeline Runner) or API: POST /pipeline/run/step0 β step1 β step2 β step3 β step4
- Test search: Semantic Search page in UI or POST /search/semantic
Note: Step3 (embedding) and Semantic Search need Ollama (LLM_BASE_URL). Run ollama pull qwen3-embedding:8b (or your chosen model) first.
Mac M1 / Metal (MPS)
Docker runs Linux so Metal/MPS is not available in container. To use GPU on Mac M1, run backend via venv on macOS: pip install torch first then pip install -r requirements.txt. PyTorch will use MPS.
Build on server without GPU
Backend image defaults to PyTorch CPU-only (~2GB). Requires DOCKER_BUILDKIT=1 when building. DOCKER_BUILDKIT=1 docker compose up --build
Troubleshooting
- Compose error: Ensure HOST_LAKE_PATH exists and is an absolute path.
- Frontend "Connection refused": Backend must run first; check API_BASE_URL.
- Search returns empty: Have you run step3 + step4? Does collection have data? Check EMBED_MODEL matches model used in step3.