Backend API

LakeFlow backend is a FastAPI application. Base URL: http://localhost:8011 (dev) or deployment URL.

Interactive docs: Swagger UI /docs, ReDoc /redoc. Use Bearer token from POST /auth/login for auth-required endpoints.

Routes overview

Endpoint	Method	Description
`/health`	GET	Health check. Response: `{status: ok}`. For liveness probe.
`/auth`	—	Login (`POST /auth/login`), token, `GET /auth/me`
`/search`	—	Embed (`POST /embed`), semantic (`POST /semantic`), Q&A (`POST /qa`)
`/pipeline`	—	Run pipeline step0–step4 (`GET /folders/{step}`, `POST /run/{step}`)
`/system`	—	Path Data Lake: `GET/POST /data-path`
`/qdrant`	—	Proxy Qdrant: collections, points, filter
`/inbox`	—	Upload (`POST /upload`), `GET /domains`, `GET /list`
`/admin`	—	Users, delete messages
`/admission_agent/v1`	—	Example agent for AI Portal. See Admission Agent section.

Auth

Demo mechanism: username/password hard-coded. JWT token used for protected endpoints (Q&A, Admin).

POST /auth/login

Request: { username: string, password: string }

Response: { access_token: string } — JWT, expires in 24h

# Example: login and save token
TOKEN=$(curl -s -X POST "http://localhost:8011/auth/login" \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"admin123"}' | jq -r '.access_token')

GET /auth/me

Requires header Authorization: Bearer <token>. Returns { username: string }.

Endpoints requiring Bearer token: POST /search/qa, GET /admin/users, DELETE /admin/users/{username}/messages, GET /admission_agent/v1/metadata, POST /admission_agent/v1/ask, GET /admission_agent/v1/data.

curl -X POST "http://localhost:8011/search/qa" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"question":"What is the enrollment quota?","top_k":5}'

Search APIs

All use EMBED_MODEL (Ollama). Ensure model is pulled (ollama pull qwen3-embedding:8b).

POST /search/embed

Convert text to vector. Same model as semantic search and step3 embedding.

Field	Type	Description
`text`	string	Text to embed (required)

curl -X POST "http://localhost:8011/search/embed" \
  -H "Content-Type: application/json" \
  -d '{"text":"University admission regulations"}'

Response: { text, vector, embedding, dim } — vector and embedding are the same; dim depends on model (e.g. qwen3-embedding:8b ≈ 1024).

POST /search/semantic

Semantic search via Qdrant. Returns chunks by cosine similarity.

Field	Type	Default
`query`	string	—
`top_k`	int	5
`collection_name`	string?	lakeflow_chunks
`score_threshold`	float?	—
`qdrant_url`	string?	default Qdrant

Response: { query, results: [{ id, score, file_hash, chunk_id, section_id, text, token_estimate, source }] }

curl -X POST "http://localhost:8011/search/semantic" \
  -H "Content-Type: application/json" \
  -d '{"query":"admission requirements","top_k":5,"collection_name":"lakeflow_chunks"}'

POST /search/qa

RAG Q&A: find context via semantic search, then LLM (Ollama/OpenAI) answers. Requires Bearer token.

Field	Type	Default
`question`	string	—
`top_k`	int	5
`temperature`	float	0.7
`collection_name`	string?	—
`score_threshold`	float?	—
`qdrant_url`	string?	—

Response: { question, answer, contexts, model_used, debug_info }. debug_info contains steps_completed, curl_embed, curl_search, curl_complete for debugging.

Pipeline API

Run each pipeline step (step0→step4). Each step is a subprocess running Python script; timeout 1h.

GET /pipeline/embed-models

List of models for step3. Returns { models: string[], default: string } — from EMBED_MODEL_OPTIONS or default list.

GET /pipeline/folders/{step}

List folders that can be run for the step. step0: domain in inbox; step1: file_hash; step2/3/4: domain or file_hash.

Response: { step, folders: string[] }

POST /pipeline/run/{step}

Run one pipeline step. Body (optional):

only_folders — run only on these folders (domain or file_hash)
force_rerun — run again even if already processed
embed_model — step3 only: Ollama model (e.g. qwen3-embedding:8b, nomic-embed-text)
collection_name — step4 only: Qdrant collection name (default lakeflow_chunks)
qdrant_url — step4 only: Qdrant URL (e.g. http://host:6333)

# Run all
curl -X POST "http://localhost:8011/pipeline/run/step0" -H "Content-Type: application/json" -d '{}'

# regulations domain only, force rerun
curl -X POST "http://localhost:8011/pipeline/run/step3" -H "Content-Type: application/json" \
  -d '{"only_folders":["regulations"],"force_rerun":true,"embed_model":"nomic-embed-text"}'

Response: { step, script, returncode, stdout, stderr }. returncode=0 means success.

System API

GET /system/health-detail — Backend status + Qdrant connection. Returns backend, qdrant_connected, qdrant_error, qdrant_url.
GET /system/config — Runtime config (no secrets). For System Settings UI.
GET /system/zones-status — Per-zone status: exists, file_count. Returns zones[], all_zones_exist.
POST /system/create-zones — Create missing zones in current path. Idempotent.
GET /system/data-path — Returns { data_base_path: string | null } (current LAKE_ROOT)
POST /system/data-path — Body { path: string } — Set Data Lake path. Path must exist and have all 6 zones.

Inbox API

Upload files to inbox and auto-run pipeline step0→step4 (background).

POST /inbox/upload

Multipart form:

Field	Required	Description
`domain`	Yes	Subfolder in 000_inbox (e.g. regulations, syllabus). Only a-z, 0-9, _, -
`path`	No	Subpath within domain (e.g. folder1/folder2)
`files`	Yes	File(s) to upload. Supported: .pdf, .docx, .xlsx, .xls, .pptx, .txt. Max 100 MB/file
`qdrant_url`	No	Qdrant URL for step4 (default uses default Qdrant)

# Upload a single file
curl -X POST "http://localhost:8011/inbox/upload" \
  -F "domain=regulations" \
  -F "files=@document.pdf"

# Upload multiple files to subpath
curl -X POST "http://localhost:8011/inbox/upload" \
  -F "domain=syllabus" \
  -F "path=2024/course_a" \
  -F "files=@doc1.pdf" -F "files=@doc2.docx"

Response: { uploaded: string[], errors: string[] }. After successful upload, pipeline runs in background; step4 uses collection_name = domain.

GET /inbox/domains

Returns { domains: string[] } — list of top-level folders in 000_inbox.

GET /inbox/list

Query params: domain (optional), path (optional).

Response: Without domain: { domains[], files[], folders[] }. With domain: { domain, path, folders[], files[] } — files have name, size, mtime.

Admin API

Requires Bearer token.

GET /admin/users — List users and message stats (for Q&A)
DELETE /admin/users/{username}/messages — Delete user message history

Admission Agent — Example for AI Portal

/admission_agent/v1 is an example agent that demonstrates how to build an AI agent for AI Portal to consume. LakeFlow handles the data pipeline (documents → inbox → embedding → Qdrant); this agent exposes a compatible API so AI Portal can connect and use the data.

Use case: Upload admission/enrollment documents to the Data Lake, run the pipeline into the Admission collection, then register this agent in AI Portal. Users can then ask questions via AI Portal; the agent uses semantic search + LLM (RAG) to answer.

You can implement similar agents for other domains (regulations, syllabus, etc.) by replicating this pattern. API shape matches Research Agent (/metadata, /data, /ask).

Endpoints

GET /admission_agent/v1/metadata — Agent metadata (name, description, capabilities). May be public; AI Portal uses this to discover the agent.
GET /admission_agent/v1/data — List of data sources from Admission collection. Requires Bearer token.
POST /admission_agent/v1/ask — RAG Q&A over Admission documents. Body: { prompt: string, session_id?: string, model_id?: string, user?: string, context?: object }. Only prompt is required. Requires Bearer token.

Requirements

LLM_BASE_URL (Ollama) — for embedding and chat completion
Data in Qdrant collection Admission — ingest documents via LakeFlow pipeline (domain → step0→step4 with collection_name=Admission)

Example: register in AI Portal

Provide AI Portal with the agent base URL (e.g. http://your-backend:8011/admission_agent/v1). AI Portal will call /metadata to display the agent, then /ask (with user token) for questions.

Qdrant proxy

Proxy to Qdrant REST API. Query collections and points without direct Qdrant access.

GET /qdrant/collections — List collections
GET /qdrant/collections/{name} — Collection info
GET /qdrant/collections/{name}/points — Get points (scroll, limit, offset)
POST /qdrant/collections/{name}/filter — Filter points (body: filter conditions)

Supports qdrant_url to point to different Qdrant (multi-Qdrant). See Swagger for request body details.

Python integration example

Use requests to call APIs from Python:

import requests

BASE = "http://localhost:8011"

# 1. Semantic search (no auth)
r = requests.post(f"{BASE}/search/semantic", json={
    "query": "admission regulations",
    "top_k": 5,
    "collection_name": "lakeflow_chunks"
})
results = r.json()["results"]

# 2. Q&A (login first)
login = requests.post(f"{BASE}/auth/login", json={
    "username": "admin", "password": "admin123"
})
token = login.json()["access_token"]
headers = {"Authorization": f"Bearer {token}"}
qa = requests.post(f"{BASE}/search/qa", json={
    "question": "Admission requirements?", "top_k": 5
}, headers=headers)
print(qa.json()["answer"])

# 3. Upload + auto pipeline
with open("doc.pdf", "rb") as f:
    r = requests.post(f"{BASE}/inbox/upload",
        data={"domain": "regulations"},
        files={"files": ("doc.pdf", f, "application/pdf")}
    )
print(r.json())  # {"uploaded": ["doc.pdf"], "errors": []}

← Getting StartedFrontend (Streamlit) →