NBOT · Horizon
documentation

Horizon manual

Everything about driving NBOT Horizon — usage, the suites→RL→prompt-opt loop, the RL policy plane, the API surface, and the per-page frontend contract.

Horizon — full usage guide

The complete manual for nbot Horizon: the trace / debug / eval / train plane for nbot. This is the self-contained "how to use it" doc; the companions are API.md (terse endpoint reference), FE_WIRING.md (per-page frontend contract), and WORKFLOW.md (the suites → RL → prompt-opt loop).


1. What Horizon is

nbot is the orchestrator; it only emits a trace (.nconfig/logs/s-*.jsonl). Horizon consumes that trace and adds the evaluation + optimization plane — the LangSmith analog for nbot. It does four things:

  • Trace — normalize each session into a Run → Turn → Span tree you can drill.
  • Debug — inspect any step's raw payload and the whole prompt a model saw.
  • Evaluate — score nbot on curated suites of cases.
  • Train / optimize — turn traces into an RL (state, action, reward) dataset and train nbot's routing policy; and optimize prompts against a suite.

The invariant: nbot owns the algorithm, state/action space, reward shaping, and param format. Horizon is the environment that trains/evaluates whatever is plugged in — it never defines a foreign policy.

The throughline: a suite turns "is nbot good at X?" into a number, and that same number is the signal for both RL training and prompt optimization.


2. Install, build, run

git clone <horizon repo> && cd nbot-horizon
cmake -S . -B build && cmake --build build -j        # no system deps (FetchContent)
./build/nbot-horizon --project /path/to/nbot/project # serves :8848

Flags: --project DIR (default cwd; the dir with .nconfig/), --logs DIR (default <project>/.nconfig/logs), --nbot BASE_URL (default: discovered from <project>/.nconfig/control.json), --port N (default 8848), --cert FILE --key FILE (serve over HTTPS).

HTTPS (so an HTTPS frontend avoids mixed-content):

openssl req -x509 -newkey rsa:2048 -nodes -days 365 -subj /CN=localhost \
  -keyout key.pem -out cert.pem
./build/nbot-horizon --project /path --cert cert.pem --key key.pem

Expose over ngrok (TLS at ngrok's edge, no local cert):

./start.sh /path/to/project          # :8848 -> https://horizonnbot.ngrok.app
./start.sh /path/to/project 8080     # ngrok forwards the SAME port Horizon binds

The API is CORS-open (a separately-built frontend can call it from anywhere).


3. The serve prerequisite + gating

Two layers must be up:

LayerCheckIf down
Horizon backendGET /api/health{ok:true,…}nothing works; FE shows "backend unreachable"
nbot serveGET /api/nbot{connected:true,…}running ops fail (see below); browsing still works

Horizon discovers the serve from <project>/.nconfig/control.json. Start it:

cd <nbot project> && ./nbot     # the TUI also serves the control API (or: nbot serve)
curl -s $H/api/nbot | jq .connected     # must be true

Offline (no serve) works: trace, stats, RL dataset/schema, dataset CRUD, suite create/list/history, prompt version/run history. Needs a serve: suite run, RL train, prompt-opt, prompt registry read/deploy, chat. These return 502 {ok:false,error:"serve unreachable"} — treat that as "start nbot serve", not a hard error.

Set a shell var for the examples below:

H=https://horizonnbot.ngrok.app        # or http://127.0.0.1:8848

4. Sessions & traces

A Run is one session; it has Turns (one per user input); each Turn has Spans (typed events).

Run { session_id, node, started_at, ended_at, turns[], summary }
Turn{ index, ts, input, status, model, exec_mode, duration_ms,
      counts{tools,tool_errors,anomalies}, spans[] }
Span{ id, kind, name, ts, end_ts, duration_ms, status, summary, data, children[] }

kindtool | generation | routing | plan | wave | leaf | artifact | grounding | reward | anomaly | event.

curl -s $H/api/sessions | jq '.[0]'                  # list, newest-first
curl -s $H/api/sessions/s-24933/trace | jq .         # the Run tree
curl -s "$H/api/sessions/s-24933/raw?type=tool_call" # verbatim events, filtered
  • Whole prompt per step: a generation span carries data.prompt_full (the entire assembled prompt) + data.model + data.prompt_chars. That's how you inspect exactly what a model saw.
  • Realtime: GET /api/live{session_id} of the active session, then GET /api/sessions/{id}/stream (SSE: data: {ts,type,data} per event).
  • Import a session captured elsewhere: POST /api/sessions/import {id, jsonl}.
  • A/B compare two runs: GET /api/sessions/{a}/diff/{b} → turn-aligned diff with per-field changed flags (the "did my prompt/RL change help" view).
  • Aggregate: GET /api/stats → counts of sessions, anomalies, tool calls / errors per tool, model runs per model.

5. RL dataset & training

Each turn becomes one (state, action, reward) sample — the training data for nbot's router.

  • Schema (render tables/forms from this; don't hardcode field names):
    curl -s $H/api/rl/schema | jq .
    
    { state:[fld], action:[fld], reward:[fld], reward_spec:{w_*, formula, note} }, fld = {name, type, desc, values?[], range?, goal?}, type ∈ float|int|bool|string|enum. exec_mode is an enum [DIRECT, TOOL_LIGHT, TOOL_FULL, DECOMPOSE].
  • Transitions:
    curl -s "$H/api/rl/dataset?session=s-24933" | jq '.[0]'
    
    {session, turn, ts, state{}, action{}, reward{}, reward_scalar}.
  • The REWARD column = reward_scalar (a signed number, null if the turn logged no outcome). Computed by the backend so the FE never miscomputes it:
    reward_scalar = w_success·success − w_iter·iters − w_tokens·tokens
                  − w_latency·latency_ms − w_toolfail·toolfails
    
    A fast success ≈ +0.66; a 150-second turn ≈ −2.3 (slow is penalized). Do not hand-combine the reward object.
  • Aggregate (KPI cards + "policy view"): GET /api/rl/aggregate?by=model|category|exec_mode{summary{transitions,sessions,success_rate,mean_reward,mean_latency_ms}, groups:[{key,count,success_rate,mean_reward,mean_latency_ms}]}. The groups answer "which model/exec_mode wins for which task kind".
  • Export: GET /api/rl/dataset.jsonl (NDJSON, for external trainers).
  • Train (needs serve): POST /api/rl/train {reset?:false, save?:true}{ok, applied, policy, stats}. Drives nbot's /v1/rl/replay; reset:true retrains from a cold prior, save:true writes the params (e.g. rl_state.bin).

6. Datasets

Curated, versioned sample collections (under <project>/.nconfig/datasets/).

CallPurpose
GET /api/datasetslist [{id,name,kind,version,created_at,count}]
POST /api/datasets {name, kind}create (kind: rl/case/turn)
GET /api/datasets/{id}full {…, samples[]}
GET /api/datasets/{id}.jsonlNDJSON export
DELETE /api/datasets/{id}delete (+ version snapshots)
POST /api/datasets/{id}/from-session {session}append samples built by kind
POST /api/datasets/import {name, kind, samples[]|jsonl}create + fill
POST /api/datasets/{id}/versionsnapshot to id@vN, bump version

7. Suites (evaluation)

A suite is a named set of cases. Running it sends each case's prompt to the serve (one turn → one session → one trace) and scores it.

Case = { id?, prompt, expect, notes? }
expect = { success?:bool, contains?:[str], absent?:[str],
           file_exists?:str, min_file_bytes?:int, exec_mode?:str }   // all optional

Create + run + read the dashboard:

SID=$(curl -s -X POST $H/api/suites -d '{"name":"smoke"}' | jq -r .id)
curl -s -X POST $H/api/suites/$SID/cases --data-binary @docs/examples/smoke-suite.json
curl -s -X POST $H/api/suites/$SID/run | jq '{passed,total,results}'   # 502 if no serve
curl -s $H/api/suites/$SID/runs | jq '.[0]'                            # history
SuiteRun = {
  run_at, passed, total, rate, tool_errors,
  results: [ { case_id, session_id, passed, fails:[], tool_errors } ]
}
  • Headline = passed/total (+ a pass-rate bar); …/runs = trend over time.
  • Each results[].session_id opens in the trace view → one trace per case.
  • tool_errors (per case + summed) tells "the tool broke" from "nbot was wrong": a failing case with tool_errors>0 points at the tool.

There's a ready example in docs/examples/smoke-suite.json (greet / math / web-title / file-write).


8. Trainers

Register the trainable things so any algorithm is trained/evaluated uniformly.

CallPurpose
GET/POST /api/trainerslist / create {name, kind:"nbot-replay"|"external", cmd?}
GET/DELETE /api/trainers/{id}view / remove
POST /api/trainers/{id}/train {dataset_id, reset?, save?}run a TrainJob
GET /api/trainers/{id}/jobsrun history, newest-first
  • nbot-replay — training = drive the serve's /v1/rl/replay over a dataset; the active nbot policy folds it in and saves its own params (502 if no serve).
  • external — a Python/offline-RL process consuming the dataset jsonl; Horizon records the hand-off (/api/datasets/{id}.jsonl), you run it out of band.

9. Prompts & prompt optimization

Every authored prompt in nbot is a typed unit in the registry: kindsystem | tool | control | adapter | notes | user. Horizon lets you pick the exact prompt, see its content, version it, and optimize it against a suite.

9.1 Pick + inspect

curl -s $H/api/prompts | jq '.prompts[] | {id,kind,overridden}'        # selector (group by kind)
curl -s "$H/api/prompts/tool.vault?model=qwen3.6-27b" | jq .           # one unit, as qwen sees it

Each unit: {id, kind, version, overridden, default, effective} (+ effective_global, model_scoped when ?model= is passed).

  • default = the built-in template; effective = what's live now (override or default). Diff effective vs default when overridden.
  • kind is the where: a tool's prompt vs the qwen adapter's prompt vs the system prompt vs a control classifier. The tool group has one unit per tool (tool.bash, tool.vault, tool.web, …) plus the whole catalog (tool.catalog.full) — pick a tool, see/edit exactly its prompt.
  • ?model=<id> re-resolves the prompt as that adapter sees it (model_scoped = has its own override vs inherits global). Get model ids from GET /api/models.
  • The single-unit endpoint adds version_count + run_count.

9.2 Deploy / clear an override

curl -s -X POST $H/api/prompts -d '{"id":"tool.catalog.full","template":"…","scope":"","note":"tighter"}'
curl -s -X DELETE $H/api/prompts/tool.catalog.full     # revert to built-in default
  • scope = a model id → a per-model override (e.g. only the qwen adapter); empty = global. Override precedence: model-scope > global > built-in default. Get the adapter list for the scope picker from GET /api/models ({models:[{id,node,capacity,load,online}]} — the live pool).
  • Every deploy is snapshotted as a version.

9.3 Version history + rollback

curl -s $H/api/prompts/tool.catalog.full/versions | jq .          # newest-first
curl -s -X POST $H/api/prompts/tool.catalog.full/revert -d '{"version":3}'

Each version: {version, template, source, note, chars, ts}, source ∈ manual | promptopt | revert. Revert re-deploys that version.

9.4 Optimize against a suite

curl -s -X POST $H/api/promptopt -H 'Content-Type: application/json' -d '{
  "prompt_id":"tool.catalog.full", "suite_id":"'"$SID"'", "generate":3
}' | jq '{best_label,best_rate,deployed, trials:[.trials[]|{label,rate,tool_errors}]}'
curl -s $H/api/prompts/tool.catalog.full/runs | jq '.[0]'         # run history

What it does: baseline-scores the suite with the current prompt, then for each candidate (generated by the serve, or pass "candidates":[…]) deploys it as an override → re-runs the suite → keeps the highest pass-rate → deploys the winner (or reverts if baseline wins). Recorded like a suite/train run.

// POST /api/promptopt response (also stored at /api/prompts/{id}/runs)
{ prompt_id, kind, suite_id, scope, deployed, best_label, best_rate,
  trials: [ { label, template, rate, passed, total, tool_errors,
              results:[ {case_id, session_id, passed, fails[], tool_errors} ] } ] }
  • Header shows prompt_id + kind (+ scope) — exactly what's being optimized.
  • tool_errors > 0 on a trial = the tool broke, not the prompt — don't tune a prompt against a broken tool. Each case session_id links to its trace.

10. Chat (drive the serve)

curl -s -X POST $H/api/nbot/run -d '{"prompt":"…","session_id":"s-x","new_session":false}'

{ok, session_id, response, produced[], files[]} (502 if no serve). session_id = send INTO that session (resume); new_session = fresh; neither = current. Pair with the SSE stream (§4) for a live view. Submits serialize on the serve — a turn to a busy session queues.


11. End-to-end: the optimization loop

            ┌──────────────── SUITE (the eval signal) ────────────────┐
sessions ──▶ dataset ──▶ RL train (replay) ──▶ run suite (better?)     │
   │                                                                    │
   └──▶ prompt unit ──▶ prompt-opt: variants ──run suite per variant──▶ deploy best
  1. BaselinePOST /api/suites/{id}/runpassed/total.
  2. Train routingPOST /api/rl/train → re-run the suite → did it move?
  3. Optimize a promptPOST /api/promptopt {prompt_id, suite_id} → the suite is the fitness function; the winner deploys.

Same signal, three levers (model routing · decomposition · prompts): discover → change → re-score → deploy, every result linking back to a trace via session_id. Full copy-paste recipe in WORKFLOW.md.


12. Endpoint reference (complete)

GroupEndpoints
MetaGET /, GET /api/health
TraceGET /api/sessions, /api/sessions/{id}/trace, /raw, /stream (SSE), /api/sessions/{a}/diff/{b}, /api/sessions/{id}/turns/{n}/rerun (POST), POST /api/sessions/import, GET /api/live, GET /api/stats
RLGET /api/rl/schema, /api/rl/dataset[.jsonl], /api/rl/aggregate, POST /api/rl/train
DatasetsGET/POST /api/datasets, GET/DELETE /api/datasets/{id}, GET /api/datasets/{id}.jsonl, POST /api/datasets/{id}/from-session, POST /api/datasets/import, POST /api/datasets/{id}/version
SuitesGET/POST /api/suites, GET/DELETE /api/suites/{id}, POST /api/suites/{id}/cases, POST /api/suites/{id}/run, GET /api/suites/{id}/runs
TrainersGET/POST /api/trainers, GET/DELETE /api/trainers/{id}, POST /api/trainers/{id}/train, GET /api/trainers/{id}/jobs
PromptsGET /api/prompts, GET /api/prompts/{id}, POST /api/prompts, DELETE /api/prompts/{id}, GET /api/prompts/{id}/versions, POST /api/prompts/{id}/revert, GET /api/prompts/{id}/runs
Prompt-optPOST /api/promptopt
NotesGET/POST /api/notes, DELETE /api/notes/{id}
Live layersGET /api/satellites, /api/agents, /api/research
nbot bridgeGET /api/nbot, GET /api/models, POST /api/nbot/run

Full request/response shapes per endpoint in API.md; per-page FE flows in FE_WIRING.md.


13. Troubleshooting

SymptomCause / fix
Banner "nbot offline" / 502 on run/train/optNo serve. cd <project> && ./nbot; GET /api/nbot must be connected:true.
Page shows mock-looking data (TOOL_FAST, tools_avail, huge rewards)FE not bound to the real endpoints. Bind tables to /api/rl/schema + /api/rl/dataset; the REWARD column = reward_scalar. Real exec_modeDIRECT/TOOL_LIGHT/TOOL_FULL/DECOMPOSE.
Suite run returns 502Same — needs a serve. Create/list/history work offline.
A suite case fails but nbot "seemed fine"Check tool_errors on the case — >0 means the tool errored, not nbot's answer. Open the case's session_id trace.
Prompt-opt "no candidates" (400)The generator returned nothing usable; pass explicit "candidates":[…].
CORS errors in the browserHorizon is CORS-open and echoes requested headers; ensure you hit the right base URL/port (and ngrok domain, if used).

Horizon is the environment; nbot owns the policy. Everything links back to a trace.