Horizon — full usage guide

The complete manual for nbot Horizon: the trace / debug / eval / train plane for nbot. This is the self-contained "how to use it" doc; the companions are API.md (terse endpoint reference), FE_WIRING.md (per-page frontend contract), and WORKFLOW.md (the suites → RL → prompt-opt loop).

1. What Horizon is

nbot is the orchestrator; it only emits a trace (.nconfig/logs/s-*.jsonl). Horizon consumes that trace and adds the evaluation + optimization plane — the LangSmith analog for nbot. It does four things:

Trace — normalize each session into a Run → Turn → Span tree you can drill.
Debug — inspect any step's raw payload and the whole prompt a model saw.
Evaluate — score nbot on curated suites of cases.
Train / optimize — turn traces into an RL (state, action, reward) dataset and train nbot's routing policy; and optimize prompts against a suite.

The invariant: nbot owns the algorithm, state/action space, reward shaping, and param format. Horizon is the environment that trains/evaluates whatever is plugged in — it never defines a foreign policy.

The throughline: a suite turns "is nbot good at X?" into a number, and that same number is the signal for both RL training and prompt optimization.

2. Install, build, run

git clone <horizon repo> && cd nbot-horizon
cmake -S . -B build && cmake --build build -j        # no system deps (FetchContent)
./build/nbot-horizon --project /path/to/nbot/project # serves :8848

Flags: --project DIR (default cwd; the dir with .nconfig/), --logs DIR (default <project>/.nconfig/logs), --nbot BASE_URL (default: discovered from <project>/.nconfig/control.json), --port N (default 8848), --cert FILE --key FILE (serve over HTTPS).

HTTPS (so an HTTPS frontend avoids mixed-content):

openssl req -x509 -newkey rsa:2048 -nodes -days 365 -subj /CN=localhost \
  -keyout key.pem -out cert.pem
./build/nbot-horizon --project /path --cert cert.pem --key key.pem

Expose over ngrok (TLS at ngrok's edge, no local cert):

./start.sh /path/to/project          # :8848 -> https://horizonnbot.ngrok.app
./start.sh /path/to/project 8080     # ngrok forwards the SAME port Horizon binds

The API is CORS-open (a separately-built frontend can call it from anywhere).

3. The serve prerequisite + gating

Two layers must be up:

Layer	Check	If down
Horizon backend	`GET /api/health` → `{ok:true,…}`	nothing works; FE shows "backend unreachable"
nbot serve	`GET /api/nbot` → `{connected:true,…}`	running ops fail (see below); browsing still works

Horizon discovers the serve from <project>/.nconfig/control.json. Start it:

cd <nbot project> && ./nbot     # the TUI also serves the control API (or: nbot serve)
curl -s $H/api/nbot | jq .connected     # must be true

Offline (no serve) works: trace, stats, RL dataset/schema, dataset CRUD, suite create/list/history, prompt version/run history. Needs a serve: suite run, RL train, prompt-opt, prompt registry read/deploy, chat. These return 502 {ok:false,error:"serve unreachable"} — treat that as "start nbot serve", not a hard error.

Set a shell var for the examples below:

H=https://horizonnbot.ngrok.app        # or http://127.0.0.1:8848

4. Sessions & traces

A Run is one session; it has Turns (one per user input); each Turn has Spans (typed events).

Run { session_id, node, started_at, ended_at, turns[], summary }
Turn{ index, ts, input, status, model, exec_mode, duration_ms,
      counts{tools,tool_errors,anomalies}, spans[] }
Span{ id, kind, name, ts, end_ts, duration_ms, status, summary, data, children[] }

curl -s $H/api/sessions | jq '.[0]'                  # list, newest-first
curl -s $H/api/sessions/s-24933/trace | jq .         # the Run tree
curl -s "$H/api/sessions/s-24933/raw?type=tool_call" # verbatim events, filtered

Whole prompt per step: a generation span carries data.prompt_full (the entire assembled prompt) + data.model + data.prompt_chars. That's how you inspect exactly what a model saw.
Realtime: GET /api/live → {session_id} of the active session, then GET /api/sessions/{id}/stream (SSE: data: {ts,type,data} per event).
Import a session captured elsewhere: POST /api/sessions/import {id, jsonl}.
A/B compare two runs: GET /api/sessions/{a}/diff/{b} → turn-aligned diff with per-field changed flags (the "did my prompt/RL change help" view).
Aggregate: GET /api/stats → counts of sessions, anomalies, tool calls / errors per tool, model runs per model.

5. RL dataset & training

Each turn becomes one (state, action, reward) sample — the training data for nbot's router.

Schema (render tables/forms from this; don't hardcode field names):
```
curl -s $H/api/rl/schema | jq .
```
{ state:[fld], action:[fld], reward:[fld], reward_spec:{w_*, formula, note} }, fld = {name, type, desc, values?[], range?, goal?}, type ∈ float|int|bool|string|enum. exec_mode is an enum [DIRECT, TOOL_LIGHT, TOOL_FULL, DECOMPOSE].
Transitions:
```
curl -s "$H/api/rl/dataset?session=s-24933" | jq '.[0]'
```
{session, turn, ts, state{}, action{}, reward{}, reward_scalar}.
The REWARD column = reward_scalar (a signed number, null if the turn logged no outcome). Computed by the backend so the FE never miscomputes it:
```
reward_scalar = w_success·success − w_iter·iters − w_tokens·tokens
              − w_latency·latency_ms − w_toolfail·toolfails
```
A fast success ≈ +0.66; a 150-second turn ≈ −2.3 (slow is penalized). Do not hand-combine the reward object.
Aggregate (KPI cards + "policy view"): GET /api/rl/aggregate?by=model|category|exec_mode → {summary{transitions,sessions,success_rate,mean_reward,mean_latency_ms}, groups:[{key,count,success_rate,mean_reward,mean_latency_ms}]}. The groups answer "which model/exec_mode wins for which task kind".
Export: GET /api/rl/dataset.jsonl (NDJSON, for external trainers).
Train (needs serve): POST /api/rl/train {reset?:false, save?:true} → {ok, applied, policy, stats}. Drives nbot's /v1/rl/replay; reset:true retrains from a cold prior, save:true writes the params (e.g. rl_state.bin).

6. Datasets

Curated, versioned sample collections (under <project>/.nconfig/datasets/).

Call	Purpose
`GET /api/datasets`	list `[{id,name,kind,version,created_at,count}]`
`POST /api/datasets {name, kind}`	create (`kind`: `rl`/`case`/`turn`)
`GET /api/datasets/{id}`	full `{…, samples[]}`
`GET /api/datasets/{id}.jsonl`	NDJSON export
`DELETE /api/datasets/{id}`	delete (+ version snapshots)
`POST /api/datasets/{id}/from-session {session}`	append samples built by `kind`
`POST /api/datasets/import {name, kind, samples[]\|jsonl}`	create + fill
`POST /api/datasets/{id}/version`	snapshot to `id@vN`, bump version

7. Suites (evaluation)

A suite is a named set of cases. Running it sends each case's prompt to the serve (one turn → one session → one trace) and scores it.

Case = { id?, prompt, expect, notes? }
expect = { success?:bool, contains?:[str], absent?:[str],
           file_exists?:str, min_file_bytes?:int, exec_mode?:str }   // all optional

Create + run + read the dashboard:

SID=$(curl -s -X POST $H/api/suites -d '{"name":"smoke"}' | jq -r .id)
curl -s -X POST $H/api/suites/$SID/cases --data-binary @docs/examples/smoke-suite.json
curl -s -X POST $H/api/suites/$SID/run | jq '{passed,total,results}'   # 502 if no serve
curl -s $H/api/suites/$SID/runs | jq '.[0]'                            # history

SuiteRun = {
  run_at, passed, total, rate, tool_errors,
  results: [ { case_id, session_id, passed, fails:[…], tool_errors } ]
}

Headline = passed/total (+ a pass-rate bar); …/runs = trend over time.
Each results[].session_id opens in the trace view → one trace per case.
tool_errors (per case + summed) tells "the tool broke" from "nbot was wrong": a failing case with tool_errors>0 points at the tool.

There's a ready example in docs/examples/smoke-suite.json (greet / math / web-title / file-write).

8. Trainers

Call	Purpose
`GET/POST /api/trainers`	list / create `{name, kind:"nbot-replay"\|"external", cmd?}`
`GET/DELETE /api/trainers/{id}`	view / remove
`POST /api/trainers/{id}/train {dataset_id, reset?, save?}`	run a `TrainJob`
`GET /api/trainers/{id}/jobs`	run history, newest-first

nbot-replay — training = drive the serve's /v1/rl/replay over a dataset; the active nbot policy folds it in and saves its own params (502 if no serve).
external — a Python/offline-RL process consuming the dataset jsonl; Horizon records the hand-off (/api/datasets/{id}.jsonl), you run it out of band.

9. Prompts & prompt optimization

9.1 Pick + inspect

curl -s $H/api/prompts | jq '.prompts[] | {id,kind,overridden}'        # selector (group by kind)
curl -s "$H/api/prompts/tool.vault?model=qwen3.6-27b" | jq .           # one unit, as qwen sees it

Each unit: {id, kind, version, overridden, default, effective} (+ effective_global, model_scoped when ?model= is passed).

default = the built-in template; effective = what's live now (override or default). Diff effective vs default when overridden.
kind is the where: a tool's prompt vs the qwen adapter's prompt vs the system prompt vs a control classifier. The tool group has one unit per tool (tool.bash, tool.vault, tool.web, …) plus the whole catalog (tool.catalog.full) — pick a tool, see/edit exactly its prompt.
?model=<id> re-resolves the prompt as that adapter sees it (model_scoped = has its own override vs inherits global). Get model ids from GET /api/models.
The single-unit endpoint adds version_count + run_count.

9.2 Deploy / clear an override

curl -s -X POST $H/api/prompts -d '{"id":"tool.catalog.full","template":"…","scope":"","note":"tighter"}'
curl -s -X DELETE $H/api/prompts/tool.catalog.full     # revert to built-in default

scope = a model id → a per-model override (e.g. only the qwen adapter); empty = global. Override precedence: model-scope > global > built-in default. Get the adapter list for the scope picker from GET /api/models ({models:[{id,node,capacity,load,online}]} — the live pool).
Every deploy is snapshotted as a version.

9.3 Version history + rollback

curl -s $H/api/prompts/tool.catalog.full/versions | jq .          # newest-first
curl -s -X POST $H/api/prompts/tool.catalog.full/revert -d '{"version":3}'

Each version: {version, template, source, note, chars, ts}, source ∈ manual | promptopt | revert. Revert re-deploys that version.

9.4 Optimize against a suite

curl -s -X POST $H/api/promptopt -H 'Content-Type: application/json' -d '{
  "prompt_id":"tool.catalog.full", "suite_id":"'"$SID"'", "generate":3
}' | jq '{best_label,best_rate,deployed, trials:[.trials[]|{label,rate,tool_errors}]}'
curl -s $H/api/prompts/tool.catalog.full/runs | jq '.[0]'         # run history

What it does: baseline-scores the suite with the current prompt, then for each candidate (generated by the serve, or pass "candidates":[…]) deploys it as an override → re-runs the suite → keeps the highest pass-rate → deploys the winner (or reverts if baseline wins). Recorded like a suite/train run.

// POST /api/promptopt response (also stored at /api/prompts/{id}/runs)
{ prompt_id, kind, suite_id, scope, deployed, best_label, best_rate,
  trials: [ { label, template, rate, passed, total, tool_errors,
              results:[ {case_id, session_id, passed, fails[], tool_errors} ] } ] }

Header shows prompt_id + kind (+ scope) — exactly what's being optimized.
tool_errors > 0 on a trial = the tool broke, not the prompt — don't tune a prompt against a broken tool. Each case session_id links to its trace.

10. Chat (drive the serve)

curl -s -X POST $H/api/nbot/run -d '{"prompt":"…","session_id":"s-x","new_session":false}'

→ {ok, session_id, response, produced[], files[]} (502 if no serve). session_id = send INTO that session (resume); new_session = fresh; neither = current. Pair with the SSE stream (§4) for a live view. Submits serialize on the serve — a turn to a busy session queues.

11. End-to-end: the optimization loop

            ┌──────────────── SUITE (the eval signal) ────────────────┐
sessions ──▶ dataset ──▶ RL train (replay) ──▶ run suite (better?)     │
   │                                                                    │
   └──▶ prompt unit ──▶ prompt-opt: variants ──run suite per variant──▶ deploy best

Baseline — POST /api/suites/{id}/run → passed/total.
Train routing — POST /api/rl/train → re-run the suite → did it move?
Optimize a prompt — POST /api/promptopt {prompt_id, suite_id} → the suite is the fitness function; the winner deploys.

Same signal, three levers (model routing · decomposition · prompts): discover → change → re-score → deploy, every result linking back to a trace via session_id. Full copy-paste recipe in WORKFLOW.md.

12. Endpoint reference (complete)

Group	Endpoints
Meta	`GET /`, `GET /api/health`
Trace	`GET /api/sessions`, `/api/sessions/{id}/trace`, `/raw`, `/stream` (SSE), `/api/sessions/{a}/diff/{b}`, `/api/sessions/{id}/turns/{n}/rerun (POST)`, `POST /api/sessions/import`, `GET /api/live`, `GET /api/stats`
RL	`GET /api/rl/schema`, `/api/rl/dataset[.jsonl]`, `/api/rl/aggregate`, `POST /api/rl/train`
Datasets	`GET/POST /api/datasets`, `GET/DELETE /api/datasets/{id}`, `GET /api/datasets/{id}.jsonl`, `POST /api/datasets/{id}/from-session`, `POST /api/datasets/import`, `POST /api/datasets/{id}/version`
Suites	`GET/POST /api/suites`, `GET/DELETE /api/suites/{id}`, `POST /api/suites/{id}/cases`, `POST /api/suites/{id}/run`, `GET /api/suites/{id}/runs`
Trainers	`GET/POST /api/trainers`, `GET/DELETE /api/trainers/{id}`, `POST /api/trainers/{id}/train`, `GET /api/trainers/{id}/jobs`
Prompts	`GET /api/prompts`, `GET /api/prompts/{id}`, `POST /api/prompts`, `DELETE /api/prompts/{id}`, `GET /api/prompts/{id}/versions`, `POST /api/prompts/{id}/revert`, `GET /api/prompts/{id}/runs`
Prompt-opt	`POST /api/promptopt`
Notes	`GET/POST /api/notes`, `DELETE /api/notes/{id}`
Live layers	`GET /api/satellites`, `/api/agents`, `/api/research`
nbot bridge	`GET /api/nbot`, `GET /api/models`, `POST /api/nbot/run`

Full request/response shapes per endpoint in API.md; per-page FE flows in FE_WIRING.md.

13. Troubleshooting

Symptom	Cause / fix
Banner "nbot offline" / `502` on run/train/opt	No serve. `cd <project> && ./nbot`; `GET /api/nbot` must be `connected:true`.
Page shows mock-looking data (`TOOL_FAST`, `tools_avail`, huge rewards)	FE not bound to the real endpoints. Bind tables to `/api/rl/schema` + `/api/rl/dataset`; the REWARD column = `reward_scalar`. Real `exec_mode` ∈ `DIRECT/TOOL_LIGHT/TOOL_FULL/DECOMPOSE`.
Suite run returns `502`	Same — needs a serve. Create/list/history work offline.
A suite case fails but nbot "seemed fine"	Check `tool_errors` on the case — `>0` means the tool errored, not nbot's answer. Open the case's `session_id` trace.
Prompt-opt "no candidates" (400)	The generator returned nothing usable; pass explicit `"candidates":[…]`.
CORS errors in the browser	Horizon is CORS-open and echoes requested headers; ensure you hit the right base URL/port (and ngrok domain, if used).

Horizon is the environment; nbot owns the policy. Everything links back to a trace.

Horizon manual