LakeFlow
Get Started

Getting Started

System requirements

  • Docker β‰₯ 20.x and Docker Compose β‰₯ 2.x (for Docker install)
  • Python 3.10+ (for local dev without Docker)
  • Disk space: backend image ~2GB (PyTorch CPU), Qdrant ~500MB

Quick install (Docker)

Run Backend, Frontend and Qdrant with Docker Compose:

  1. Clone and prepare env:
git clone https://github.com/Lampx83/LakeFlow.git LakeFlow
cd LakeFlow
cp env.example .env   # or cp .env.example .env
  1. Required: Edit .env β€” set HOST_LAKE_PATH to the absolute path of the Data Lake directory. Examples:
    • macOS: HOST_LAKE_PATH=/Users/you/lakeflow_data
    • Linux: HOST_LAKE_PATH=/datalake/research
    Directory must exist before running.
  2. Create directory if needed: mkdir -p $HOST_LAKE_PATH
  3. Create zones: mkdir -p $HOST_LAKE_PATH/000_inbox $HOST_LAKE_PATH/100_raw $HOST_LAKE_PATH/200_staging $HOST_LAKE_PATH/300_processed $HOST_LAKE_PATH/400_embeddings $HOST_LAKE_PATH/500_catalog
  4. Run: docker compose up --build (or -d for background)

Note: Docker volume uses device: $HOST_LAKE_PATH. Empty variable or non-existent path will cause compose to fail.

After successful startup, services are available at:

ServiceURLNotes
Backend APIhttp://localhost:8011Base URL for API calls
Swagger UIhttp://localhost:8011/docsInteractive API docs
Streamlit UIhttp://localhost:8012Login: admin / admin123
Qdranthttp://localhost:8013Vector DB (port 6333 in container mapped to 8013)

Project structure

LakeFlow/
β”œβ”€β”€ backend/                   # FastAPI + pipeline scripts (Python)
β”‚   β”œβ”€β”€ src/lakeflow/          # Main package
β”‚   β”‚   β”œβ”€β”€ api/               # Routers: auth, search, pipeline, inbox, qdrant, ...
β”‚   β”‚   β”œβ”€β”€ scripts/           # step0_inbox, step1_raw, step2_staging, step3, step4
β”‚   β”‚   └── ...
β”‚   └── requirements.txt
β”œβ”€β”€ frontend/streamlit/        # Streamlit control UI
β”‚   β”œβ”€β”€ app.py                 # Entry point
β”‚   β”œβ”€β”€ pages/                 # Dashboard, Pipeline Runner, Semantic Search, ...
β”‚   └── services/              # api_client, pipeline_service, ...
β”œβ”€β”€ website/                   # Docs site (Next.js)
β”œβ”€β”€ docker-compose.yml        # Docker config
β”œβ”€β”€ env.example               # Env var template
└── README.md

Local development (without Docker)

Run backend and frontend directly on your machine for faster debugging. Requires Python 3.10+.

  1. Qdrant: docker compose up -d qdrant
  2. Backend:
    cd backend
    python3 -m venv .venv
    source .venv/bin/activate   # Windows: .venv\Scripts\activate
    pip install torch           # Mac M1: install first for Metal (MPS)
    pip install -r requirements.txt && pip install -e .
    uvicorn lakeflow.main:app --reload --port 8011
  3. Frontend: From repo root:
    python frontend/streamlit/dev_with_reload.py
    # or: streamlit run frontend/streamlit/app.py

.env in repo root needs: LAKE_ROOT, QDRANT_HOST=localhost, API_BASE_URL=http://localhost:8011. dev_with_reload.py auto-loads .env from repo root.

First run workflow

  1. Create zones (if needed): mkdir -p $HOST_LAKE_PATH/000_inbox $HOST_LAKE_PATH/100_raw $HOST_LAKE_PATH/200_staging $HOST_LAKE_PATH/300_processed $HOST_LAKE_PATH/400_embeddings $HOST_LAKE_PATH/500_catalog
  2. Add files to inbox: Copy PDF/Word/Excel to 000_inbox/<domain>/ (e.g. 000_inbox/regulations/doc.pdf) or call POST /inbox/upload
  3. Run pipeline: Via Streamlit (Pipeline Runner) or API: POST /pipeline/run/step0 β†’ step1 β†’ step2 β†’ step3 β†’ step4
  4. Test search: Semantic Search page in UI or POST /search/semantic

Note: Step3 (embedding) and Semantic Search need Ollama (LLM_BASE_URL). Run ollama pull qwen3-embedding:8b (or your chosen model) first.

Mac M1 / Metal (MPS)

Docker runs Linux so Metal/MPS is not available in container. To use GPU on Mac M1, run backend via venv on macOS: pip install torch first then pip install -r requirements.txt. PyTorch will use MPS.

Build on server without GPU

Backend image defaults to PyTorch CPU-only (~2GB). Requires DOCKER_BUILDKIT=1 when building. DOCKER_BUILDKIT=1 docker compose up --build

Troubleshooting

  • Compose error: Ensure HOST_LAKE_PATH exists and is an absolute path.
  • Frontend "Connection refused": Backend must run first; check API_BASE_URL.
  • Search returns empty: Have you run step3 + step4? Does collection have data? Check EMBED_MODEL matches model used in step3.