LakeFlow
Get Started

Backend API

LakeFlow backend is a FastAPI application. Base URL: http://localhost:8011 (dev) or deployment URL.

Interactive docs: Swagger UI /docs, ReDoc /redoc. Use Bearer token from POST /auth/login for auth-required endpoints.

Routes overview

EndpointMethodDescription
/healthGETHealth check. Response: {status: ok}. For liveness probe.
/authβ€”Login (POST /auth/login), token, GET /auth/me
/searchβ€”Embed (POST /embed), semantic (POST /semantic), Q&A (POST /qa)
/pipelineβ€”Run pipeline step0–step4 (GET /folders/{step}, POST /run/{step})
/systemβ€”Path Data Lake: GET/POST /data-path
/qdrantβ€”Proxy Qdrant: collections, points, filter
/inboxβ€”Upload (POST /upload), GET /domains, GET /list
/adminβ€”Users, delete messages
/admission_agent/v1β€”Example agent for AI Portal. See Admission Agent section.

Auth

Demo mechanism: username/password hard-coded. JWT token used for protected endpoints (Q&A, Admin).

POST /auth/login

Request: { username: string, password: string }

Response: { access_token: string } β€” JWT, expires in 24h

# Example: login and save token
TOKEN=$(curl -s -X POST "http://localhost:8011/auth/login" \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"admin123"}' | jq -r '.access_token')

GET /auth/me

Requires header Authorization: Bearer <token>. Returns { username: string }.

Endpoints requiring Bearer token: POST /search/qa, GET /admin/users, DELETE /admin/users/{username}/messages, GET /admission_agent/v1/metadata, POST /admission_agent/v1/ask, GET /admission_agent/v1/data.

curl -X POST "http://localhost:8011/search/qa" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"question":"What is the enrollment quota?","top_k":5}'

Search APIs

All use EMBED_MODEL (Ollama). Ensure model is pulled (ollama pull qwen3-embedding:8b).

POST /search/embed

Convert text to vector. Same model as semantic search and step3 embedding.

FieldTypeDescription
textstringText to embed (required)
curl -X POST "http://localhost:8011/search/embed" \
  -H "Content-Type: application/json" \
  -d '{"text":"University admission regulations"}'

Response: { text, vector, embedding, dim } β€” vector and embedding are the same; dim depends on model (e.g. qwen3-embedding:8b β‰ˆ 1024).

POST /search/semantic

Semantic search via Qdrant. Returns chunks by cosine similarity.

FieldTypeDefault
querystringβ€”
top_kint5
collection_namestring?lakeflow_chunks
score_thresholdfloat?β€”
qdrant_urlstring?default Qdrant

Response: { query, results: [{ id, score, file_hash, chunk_id, section_id, text, token_estimate, source }] }

curl -X POST "http://localhost:8011/search/semantic" \
  -H "Content-Type: application/json" \
  -d '{"query":"admission requirements","top_k":5,"collection_name":"lakeflow_chunks"}'

POST /search/qa

RAG Q&A: find context via semantic search, then LLM (Ollama/OpenAI) answers. Requires Bearer token.

FieldTypeDefault
questionstringβ€”
top_kint5
temperaturefloat0.7
collection_namestring?β€”
score_thresholdfloat?β€”
qdrant_urlstring?β€”

Response: { question, answer, contexts, model_used, debug_info }. debug_info contains steps_completed, curl_embed, curl_search, curl_complete for debugging.

Pipeline API

Run each pipeline step (step0β†’step4). Each step is a subprocess running Python script; timeout 1h.

GET /pipeline/embed-models

List of models for step3. Returns { models: string[], default: string } β€” from EMBED_MODEL_OPTIONS or default list.

GET /pipeline/folders/{step}

List folders that can be run for the step. step0: domain in inbox; step1: file_hash; step2/3/4: domain or file_hash.

Response: { step, folders: string[] }

POST /pipeline/run/{step}

Run one pipeline step. Body (optional):

  • only_folders β€” run only on these folders (domain or file_hash)
  • force_rerun β€” run again even if already processed
  • embed_model β€” step3 only: Ollama model (e.g. qwen3-embedding:8b, nomic-embed-text)
  • collection_name β€” step4 only: Qdrant collection name (default lakeflow_chunks)
  • qdrant_url β€” step4 only: Qdrant URL (e.g. http://host:6333)
# Run all
curl -X POST "http://localhost:8011/pipeline/run/step0" -H "Content-Type: application/json" -d '{}'

# regulations domain only, force rerun
curl -X POST "http://localhost:8011/pipeline/run/step3" -H "Content-Type: application/json" \
  -d '{"only_folders":["regulations"],"force_rerun":true,"embed_model":"nomic-embed-text"}'

Response: { step, script, returncode, stdout, stderr }. returncode=0 means success.

System API

  • GET /system/health-detail β€” Backend status + Qdrant connection. Returns backend, qdrant_connected, qdrant_error, qdrant_url.
  • GET /system/config β€” Runtime config (no secrets). For System Settings UI.
  • GET /system/zones-status β€” Per-zone status: exists, file_count. Returns zones[], all_zones_exist.
  • POST /system/create-zones β€” Create missing zones in current path. Idempotent.
  • GET /system/data-path β€” Returns { data_base_path: string | null } (current LAKE_ROOT)
  • POST /system/data-path β€” Body { path: string } β€” Set Data Lake path. Path must exist and have all 6 zones.

Inbox API

Upload files to inbox and auto-run pipeline step0β†’step4 (background).

POST /inbox/upload

Multipart form:

FieldRequiredDescription
domainYesSubfolder in 000_inbox (e.g. regulations, syllabus). Only a-z, 0-9, _, -
pathNoSubpath within domain (e.g. folder1/folder2)
filesYesFile(s) to upload. Supported: .pdf, .docx, .xlsx, .xls, .pptx, .txt. Max 100 MB/file
qdrant_urlNoQdrant URL for step4 (default uses default Qdrant)
# Upload a single file
curl -X POST "http://localhost:8011/inbox/upload" \
  -F "domain=regulations" \
  -F "files=@document.pdf"

# Upload multiple files to subpath
curl -X POST "http://localhost:8011/inbox/upload" \
  -F "domain=syllabus" \
  -F "path=2024/course_a" \
  -F "files=@doc1.pdf" -F "files=@doc2.docx"

Response: { uploaded: string[], errors: string[] }. After successful upload, pipeline runs in background; step4 uses collection_name = domain.

GET /inbox/domains

Returns { domains: string[] } β€” list of top-level folders in 000_inbox.

GET /inbox/list

Query params: domain (optional), path (optional).

Response: Without domain: { domains[], files[], folders[] }. With domain: { domain, path, folders[], files[] } β€” files have name, size, mtime.

Admin API

Requires Bearer token.

  • GET /admin/users β€” List users and message stats (for Q&A)
  • DELETE /admin/users/{username}/messages β€” Delete user message history

Admission Agent β€” Example for AI Portal

/admission_agent/v1 is an example agent that demonstrates how to build an AI agent for AI Portal to consume. LakeFlow handles the data pipeline (documents β†’ inbox β†’ embedding β†’ Qdrant); this agent exposes a compatible API so AI Portal can connect and use the data.

Use case: Upload admission/enrollment documents to the Data Lake, run the pipeline into the Admission collection, then register this agent in AI Portal. Users can then ask questions via AI Portal; the agent uses semantic search + LLM (RAG) to answer.

You can implement similar agents for other domains (regulations, syllabus, etc.) by replicating this pattern. API shape matches Research Agent (/metadata, /data, /ask).

Endpoints

  • GET /admission_agent/v1/metadata β€” Agent metadata (name, description, capabilities). May be public; AI Portal uses this to discover the agent.
  • GET /admission_agent/v1/data β€” List of data sources from Admission collection. Requires Bearer token.
  • POST /admission_agent/v1/ask β€” RAG Q&A over Admission documents. Body: { prompt: string, session_id?: string, model_id?: string, user?: string, context?: object }. Only prompt is required. Requires Bearer token.

Requirements

  • LLM_BASE_URL (Ollama) β€” for embedding and chat completion
  • Data in Qdrant collection Admission β€” ingest documents via LakeFlow pipeline (domain β†’ step0β†’step4 with collection_name=Admission)

Example: register in AI Portal

Provide AI Portal with the agent base URL (e.g. http://your-backend:8011/admission_agent/v1). AI Portal will call /metadata to display the agent, then /ask (with user token) for questions.

Qdrant proxy

Proxy to Qdrant REST API. Query collections and points without direct Qdrant access.

  • GET /qdrant/collections β€” List collections
  • GET /qdrant/collections/{name} β€” Collection info
  • GET /qdrant/collections/{name}/points β€” Get points (scroll, limit, offset)
  • POST /qdrant/collections/{name}/filter β€” Filter points (body: filter conditions)

Supports qdrant_url to point to different Qdrant (multi-Qdrant). See Swagger for request body details.

Python integration example

Use requests to call APIs from Python:

import requests

BASE = "http://localhost:8011"

# 1. Semantic search (no auth)
r = requests.post(f"{BASE}/search/semantic", json={
    "query": "admission regulations",
    "top_k": 5,
    "collection_name": "lakeflow_chunks"
})
results = r.json()["results"]

# 2. Q&A (login first)
login = requests.post(f"{BASE}/auth/login", json={
    "username": "admin", "password": "admin123"
})
token = login.json()["access_token"]
headers = {"Authorization": f"Bearer {token}"}
qa = requests.post(f"{BASE}/search/qa", json={
    "question": "Admission requirements?", "top_k": 5
}, headers=headers)
print(qa.json()["answer"])

# 3. Upload + auto pipeline
with open("doc.pdf", "rb") as f:
    r = requests.post(f"{BASE}/inbox/upload",
        data={"domain": "regulations"},
        files={"files": ("doc.pdf", f, "application/pdf")}
    )
print(r.json())  # {"uploaded": ["doc.pdf"], "errors": []}