LakeFlow
Get Started

Frontend (Streamlit)

LakeFlow frontend is a Streamlit control UI at http://localhost:8012. Connects to Backend API to run pipelines, explore Data Lake, and test Semantic Search.

Login

Default: admin / admin123. JWT token is stored in session and sent with every API request.

Pages requiring login: Q&A with AI, System Settings (some operations). Other pages can be used once backend is ready.

Pages overview

Dashboard

Pipeline status overview, run history. Quick view of file count per zone, recent pipelines.

Data Lake Explorer

Browse zone directory tree: inbox β†’ raw β†’ staging β†’ processed β†’ embeddings β†’ catalog. Select zone and path to view files; preview JSON content (validation.json, chunks.json).

Pipeline Runner

Only shown when LAKEFLOW_MODE=DEV. Manually run step0β†’step4. Options:

  • Select folder (domain or file_hash) β€” run on subset only
  • Enable Force rerun β€” run again even if already processed
  • Step3: choose embed model from dropdown (from EMBED_MODEL_OPTIONS)
  • Step4: choose collection_name, qdrant_url

Results show returncode, stdout, stderr.

SQLite Viewer

View SQLite databases in Data Lake (e.g. catalog, app DB). Select .db file, view tables and query.

Qdrant Inspector

List collections, view points in a collection. Supports custom Qdrant URL (multi-Qdrant). Useful to verify vectors after step4.

Semantic Search

Enter natural language question, get results with score. Can select collection, Qdrant URL, top_k. Use to test search before integrating API.

Q&A with AI

RAG Q&A: ask question β†’ semantic search finds context β†’ LLM (Ollama/OpenAI) answers. Login required. Displays contexts and answer.

System Settings

Full configuration: Connection status (Backend, Qdrant), runtime config table (Data Lake path, Qdrant URL, Embed/LLM model, OpenAI key set), zone status (file counts), create missing zones button, Data Lake path config. Does not display secrets (API key).

Multi–Qdrant

Semantic Search and Qdrant Inspector allow entering custom Qdrant URL and collection. Use when testing multiple vector stores or environments.

Frontend code structure

frontend/streamlit/
β”œβ”€β”€ app.py                  # Entry, sidebar, routing
β”œβ”€β”€ pages/                  # Each file = one page (Streamlit auto-detect)
β”‚   β”œβ”€β”€ pipeline_dashboard.py
β”‚   β”œβ”€β”€ data_lake_explorer.py
β”‚   β”œβ”€β”€ pipeline_runner.py
β”‚   β”œβ”€β”€ sqlite_viewer.py
β”‚   β”œβ”€β”€ qdrant_inspector.py
β”‚   β”œβ”€β”€ semantic_search.py
β”‚   β”œβ”€β”€ qa.py               # Q&A with AI
β”‚   β”œβ”€β”€ system_settings.py
β”‚   β”œβ”€β”€ admin.py
β”‚   └── login.py
β”œβ”€β”€ state/
β”‚   β”œβ”€β”€ session.py         # Session init
β”‚   └── token_store.py     # Auth token storage
└── services/
    β”œβ”€β”€ api_client.py      # HTTP client for backend
    β”œβ”€β”€ pipeline_service.py  # Calls /pipeline/run/*
    └── qdrant_service.py  # Qdrant API calls

Run locally

# From repo root
# dev_with_reload auto-loads .env from repo root
python frontend/streamlit/dev_with_reload.py

# Or run Streamlit directly (need .env or export vars)
streamlit run frontend/streamlit/app.py

When running backend locally: set API_BASE_URL=http://localhost:8011 in .env. Frontend auto-resolves lakeflow-backend β†’ localhost when hostname does not resolve (in Docker, uses service name).

Troubleshooting

  • Connection refused: Check backend is running, API_BASE_URL is correct. In Docker, frontend calls lakeflow-backend:8011.
  • Pipeline Runner not showing: Set LAKEFLOW_MODE=DEV in .env.
  • Q&A 401 error: Log in again; token may have expired.