Horizon — full usage guide
The complete manual for nbot Horizon: the trace / debug / eval / train plane
for nbot. This is the self-contained
"how to use it" doc; the companions are API.md (terse endpoint
reference), FE_WIRING.md (per-page frontend contract), and
WORKFLOW.md (the suites → RL → prompt-opt loop).
1. What Horizon is
nbot is the orchestrator; it only emits a trace (.nconfig/logs/s-*.jsonl).
Horizon consumes that trace and adds the evaluation + optimization plane —
the LangSmith analog for nbot. It does four things:
- Trace — normalize each session into a
Run → Turn → Spantree you can drill. - Debug — inspect any step's raw payload and the whole prompt a model saw.
- Evaluate — score nbot on curated suites of cases.
- Train / optimize — turn traces into an RL
(state, action, reward)dataset and train nbot's routing policy; and optimize prompts against a suite.
The invariant: nbot owns the algorithm, state/action space, reward shaping, and param format. Horizon is the environment that trains/evaluates whatever is plugged in — it never defines a foreign policy.
The throughline: a suite turns "is nbot good at X?" into a number, and that same number is the signal for both RL training and prompt optimization.
2. Install, build, run
git clone <horizon repo> && cd nbot-horizon
cmake -S . -B build && cmake --build build -j # no system deps (FetchContent)
./build/nbot-horizon --project /path/to/nbot/project # serves :8848
Flags: --project DIR (default cwd; the dir with .nconfig/), --logs DIR
(default <project>/.nconfig/logs), --nbot BASE_URL (default: discovered from
<project>/.nconfig/control.json), --port N (default 8848),
--cert FILE --key FILE (serve over HTTPS).
HTTPS (so an HTTPS frontend avoids mixed-content):
openssl req -x509 -newkey rsa:2048 -nodes -days 365 -subj /CN=localhost \
-keyout key.pem -out cert.pem
./build/nbot-horizon --project /path --cert cert.pem --key key.pem
Expose over ngrok (TLS at ngrok's edge, no local cert):
./start.sh /path/to/project # :8848 -> https://horizonnbot.ngrok.app
./start.sh /path/to/project 8080 # ngrok forwards the SAME port Horizon binds
The API is CORS-open (a separately-built frontend can call it from anywhere).
3. The serve prerequisite + gating
Two layers must be up:
| Layer | Check | If down |
|---|---|---|
| Horizon backend | GET /api/health → {ok:true,…} | nothing works; FE shows "backend unreachable" |
| nbot serve | GET /api/nbot → {connected:true,…} | running ops fail (see below); browsing still works |
Horizon discovers the serve from <project>/.nconfig/control.json. Start it:
cd <nbot project> && ./nbot # the TUI also serves the control API (or: nbot serve)
curl -s $H/api/nbot | jq .connected # must be true
Offline (no serve) works: trace, stats, RL dataset/schema, dataset CRUD,
suite create/list/history, prompt version/run history.
Needs a serve: suite run, RL train, prompt-opt, prompt registry
read/deploy, chat. These return 502 {ok:false,error:"serve unreachable"} —
treat that as "start nbot serve", not a hard error.
Set a shell var for the examples below:
H=https://horizonnbot.ngrok.app # or http://127.0.0.1:8848
4. Sessions & traces
A Run is one session; it has Turns (one per user input); each Turn has Spans (typed events).
Run { session_id, node, started_at, ended_at, turns[], summary }
Turn{ index, ts, input, status, model, exec_mode, duration_ms,
counts{tools,tool_errors,anomalies}, spans[] }
Span{ id, kind, name, ts, end_ts, duration_ms, status, summary, data, children[] }
kind ∈ tool | generation | routing | plan | wave | leaf | artifact | grounding | reward | anomaly | event.
curl -s $H/api/sessions | jq '.[0]' # list, newest-first
curl -s $H/api/sessions/s-24933/trace | jq . # the Run tree
curl -s "$H/api/sessions/s-24933/raw?type=tool_call" # verbatim events, filtered
- Whole prompt per step: a
generationspan carriesdata.prompt_full(the entire assembled prompt) +data.model+data.prompt_chars. That's how you inspect exactly what a model saw. - Realtime:
GET /api/live→{session_id}of the active session, thenGET /api/sessions/{id}/stream(SSE:data: {ts,type,data}per event). - Import a session captured elsewhere:
POST /api/sessions/import {id, jsonl}. - A/B compare two runs:
GET /api/sessions/{a}/diff/{b}→ turn-aligned diff with per-fieldchangedflags (the "did my prompt/RL change help" view). - Aggregate:
GET /api/stats→ counts of sessions, anomalies, tool calls / errors per tool, model runs per model.
5. RL dataset & training
Each turn becomes one (state, action, reward) sample — the training data for
nbot's router.
- Schema (render tables/forms from this; don't hardcode field names):
curl -s $H/api/rl/schema | jq .{ state:[fld], action:[fld], reward:[fld], reward_spec:{w_*, formula, note} },fld = {name, type, desc, values?[], range?, goal?},type ∈ float|int|bool|string|enum.exec_modeis an enum[DIRECT, TOOL_LIGHT, TOOL_FULL, DECOMPOSE]. - Transitions:
curl -s "$H/api/rl/dataset?session=s-24933" | jq '.[0]'{session, turn, ts, state{}, action{}, reward{}, reward_scalar}. - The REWARD column =
reward_scalar(a signed number,nullif the turn logged no outcome). Computed by the backend so the FE never miscomputes it:A fast success ≈ +0.66; a 150-second turn ≈ −2.3 (slow is penalized). Do not hand-combine thereward_scalar = w_success·success − w_iter·iters − w_tokens·tokens − w_latency·latency_ms − w_toolfail·toolfailsrewardobject. - Aggregate (KPI cards + "policy view"):
GET /api/rl/aggregate?by=model|category|exec_mode→{summary{transitions,sessions,success_rate,mean_reward,mean_latency_ms}, groups:[{key,count,success_rate,mean_reward,mean_latency_ms}]}. Thegroupsanswer "which model/exec_mode wins for which task kind". - Export:
GET /api/rl/dataset.jsonl(NDJSON, for external trainers). - Train (needs serve):
POST /api/rl/train {reset?:false, save?:true}→{ok, applied, policy, stats}. Drives nbot's/v1/rl/replay;reset:trueretrains from a cold prior,save:truewrites the params (e.g.rl_state.bin).
6. Datasets
Curated, versioned sample collections (under <project>/.nconfig/datasets/).
| Call | Purpose |
|---|---|
GET /api/datasets | list [{id,name,kind,version,created_at,count}] |
POST /api/datasets {name, kind} | create (kind: rl/case/turn) |
GET /api/datasets/{id} | full {…, samples[]} |
GET /api/datasets/{id}.jsonl | NDJSON export |
DELETE /api/datasets/{id} | delete (+ version snapshots) |
POST /api/datasets/{id}/from-session {session} | append samples built by kind |
POST /api/datasets/import {name, kind, samples[]|jsonl} | create + fill |
POST /api/datasets/{id}/version | snapshot to id@vN, bump version |
7. Suites (evaluation)
A suite is a named set of cases. Running it sends each case's prompt to the serve (one turn → one session → one trace) and scores it.
Case = { id?, prompt, expect, notes? }
expect = { success?:bool, contains?:[str], absent?:[str],
file_exists?:str, min_file_bytes?:int, exec_mode?:str } // all optional
Create + run + read the dashboard:
SID=$(curl -s -X POST $H/api/suites -d '{"name":"smoke"}' | jq -r .id)
curl -s -X POST $H/api/suites/$SID/cases --data-binary @docs/examples/smoke-suite.json
curl -s -X POST $H/api/suites/$SID/run | jq '{passed,total,results}' # 502 if no serve
curl -s $H/api/suites/$SID/runs | jq '.[0]' # history
SuiteRun = {
run_at, passed, total, rate, tool_errors,
results: [ { case_id, session_id, passed, fails:[…], tool_errors } ]
}
- Headline =
passed/total(+ a pass-rate bar);…/runs= trend over time. - Each
results[].session_idopens in the trace view → one trace per case. tool_errors(per case + summed) tells "the tool broke" from "nbot was wrong": a failing case withtool_errors>0points at the tool.
There's a ready example in docs/examples/smoke-suite.json (greet / math /
web-title / file-write).
8. Trainers
Register the trainable things so any algorithm is trained/evaluated uniformly.
| Call | Purpose |
|---|---|
GET/POST /api/trainers | list / create {name, kind:"nbot-replay"|"external", cmd?} |
GET/DELETE /api/trainers/{id} | view / remove |
POST /api/trainers/{id}/train {dataset_id, reset?, save?} | run a TrainJob |
GET /api/trainers/{id}/jobs | run history, newest-first |
nbot-replay— training = drive the serve's/v1/rl/replayover a dataset; the active nbot policy folds it in and saves its own params (502 if no serve).external— a Python/offline-RL process consuming the dataset jsonl; Horizon records the hand-off (/api/datasets/{id}.jsonl), you run it out of band.
9. Prompts & prompt optimization
Every authored prompt in nbot is a typed unit in the registry: kind ∈
system | tool | control | adapter | notes | user. Horizon lets you pick the
exact prompt, see its content, version it, and optimize it against a suite.
9.1 Pick + inspect
curl -s $H/api/prompts | jq '.prompts[] | {id,kind,overridden}' # selector (group by kind)
curl -s "$H/api/prompts/tool.vault?model=qwen3.6-27b" | jq . # one unit, as qwen sees it
Each unit: {id, kind, version, overridden, default, effective} (+
effective_global, model_scoped when ?model= is passed).
default= the built-in template;effective= what's live now (override or default). Diffeffectivevsdefaultwhenoverridden.kindis the where: a tool's prompt vs the qwen adapter's prompt vs the system prompt vs a control classifier. Thetoolgroup has one unit per tool (tool.bash,tool.vault,tool.web, …) plus the whole catalog (tool.catalog.full) — pick a tool, see/edit exactly its prompt.?model=<id>re-resolves the prompt as that adapter sees it (model_scoped= has its own override vs inherits global). Get model ids fromGET /api/models.- The single-unit endpoint adds
version_count+run_count.
9.2 Deploy / clear an override
curl -s -X POST $H/api/prompts -d '{"id":"tool.catalog.full","template":"…","scope":"","note":"tighter"}'
curl -s -X DELETE $H/api/prompts/tool.catalog.full # revert to built-in default
scope= a model id → a per-model override (e.g. only the qwen adapter); empty = global. Override precedence: model-scope > global > built-in default. Get the adapter list for the scope picker fromGET /api/models({models:[{id,node,capacity,load,online}]}— the live pool).- Every deploy is snapshotted as a version.
9.3 Version history + rollback
curl -s $H/api/prompts/tool.catalog.full/versions | jq . # newest-first
curl -s -X POST $H/api/prompts/tool.catalog.full/revert -d '{"version":3}'
Each version: {version, template, source, note, chars, ts},
source ∈ manual | promptopt | revert. Revert re-deploys that version.
9.4 Optimize against a suite
curl -s -X POST $H/api/promptopt -H 'Content-Type: application/json' -d '{
"prompt_id":"tool.catalog.full", "suite_id":"'"$SID"'", "generate":3
}' | jq '{best_label,best_rate,deployed, trials:[.trials[]|{label,rate,tool_errors}]}'
curl -s $H/api/prompts/tool.catalog.full/runs | jq '.[0]' # run history
What it does: baseline-scores the suite with the current prompt, then for each
candidate (generated by the serve, or pass "candidates":[…]) deploys it as an
override → re-runs the suite → keeps the highest pass-rate → deploys the winner
(or reverts if baseline wins). Recorded like a suite/train run.
// POST /api/promptopt response (also stored at /api/prompts/{id}/runs)
{ prompt_id, kind, suite_id, scope, deployed, best_label, best_rate,
trials: [ { label, template, rate, passed, total, tool_errors,
results:[ {case_id, session_id, passed, fails[], tool_errors} ] } ] }
- Header shows
prompt_id+kind(+ scope) — exactly what's being optimized. tool_errors > 0on a trial = the tool broke, not the prompt — don't tune a prompt against a broken tool. Each casesession_idlinks to its trace.
10. Chat (drive the serve)
curl -s -X POST $H/api/nbot/run -d '{"prompt":"…","session_id":"s-x","new_session":false}'
→ {ok, session_id, response, produced[], files[]} (502 if no serve).
session_id = send INTO that session (resume); new_session = fresh; neither =
current. Pair with the SSE stream (§4) for a live view. Submits serialize on the
serve — a turn to a busy session queues.
11. End-to-end: the optimization loop
┌──────────────── SUITE (the eval signal) ────────────────┐
sessions ──▶ dataset ──▶ RL train (replay) ──▶ run suite (better?) │
│ │
└──▶ prompt unit ──▶ prompt-opt: variants ──run suite per variant──▶ deploy best
- Baseline —
POST /api/suites/{id}/run→passed/total. - Train routing —
POST /api/rl/train→ re-run the suite → did it move? - Optimize a prompt —
POST /api/promptopt {prompt_id, suite_id}→ the suite is the fitness function; the winner deploys.
Same signal, three levers (model routing · decomposition · prompts): discover →
change → re-score → deploy, every result linking back to a trace via
session_id. Full copy-paste recipe in WORKFLOW.md.
12. Endpoint reference (complete)
| Group | Endpoints |
|---|---|
| Meta | GET /, GET /api/health |
| Trace | GET /api/sessions, /api/sessions/{id}/trace, /raw, /stream (SSE), /api/sessions/{a}/diff/{b}, /api/sessions/{id}/turns/{n}/rerun (POST), POST /api/sessions/import, GET /api/live, GET /api/stats |
| RL | GET /api/rl/schema, /api/rl/dataset[.jsonl], /api/rl/aggregate, POST /api/rl/train |
| Datasets | GET/POST /api/datasets, GET/DELETE /api/datasets/{id}, GET /api/datasets/{id}.jsonl, POST /api/datasets/{id}/from-session, POST /api/datasets/import, POST /api/datasets/{id}/version |
| Suites | GET/POST /api/suites, GET/DELETE /api/suites/{id}, POST /api/suites/{id}/cases, POST /api/suites/{id}/run, GET /api/suites/{id}/runs |
| Trainers | GET/POST /api/trainers, GET/DELETE /api/trainers/{id}, POST /api/trainers/{id}/train, GET /api/trainers/{id}/jobs |
| Prompts | GET /api/prompts, GET /api/prompts/{id}, POST /api/prompts, DELETE /api/prompts/{id}, GET /api/prompts/{id}/versions, POST /api/prompts/{id}/revert, GET /api/prompts/{id}/runs |
| Prompt-opt | POST /api/promptopt |
| Notes | GET/POST /api/notes, DELETE /api/notes/{id} |
| Live layers | GET /api/satellites, /api/agents, /api/research |
| nbot bridge | GET /api/nbot, GET /api/models, POST /api/nbot/run |
Full request/response shapes per endpoint in API.md; per-page FE flows
in FE_WIRING.md.
13. Troubleshooting
| Symptom | Cause / fix |
|---|---|
Banner "nbot offline" / 502 on run/train/opt | No serve. cd <project> && ./nbot; GET /api/nbot must be connected:true. |
Page shows mock-looking data (TOOL_FAST, tools_avail, huge rewards) | FE not bound to the real endpoints. Bind tables to /api/rl/schema + /api/rl/dataset; the REWARD column = reward_scalar. Real exec_mode ∈ DIRECT/TOOL_LIGHT/TOOL_FULL/DECOMPOSE. |
Suite run returns 502 | Same — needs a serve. Create/list/history work offline. |
| A suite case fails but nbot "seemed fine" | Check tool_errors on the case — >0 means the tool errored, not nbot's answer. Open the case's session_id trace. |
| Prompt-opt "no candidates" (400) | The generator returned nothing usable; pass explicit "candidates":[…]. |
| CORS errors in the browser | Horizon is CORS-open and echoes requested headers; ensure you hit the right base URL/port (and ngrok domain, if used). |
Horizon is the environment; nbot owns the policy. Everything links back to a trace.