Introduction

kanade — 奏 is an endpoint management system for Windows fleets. It gives an operator a single CLI / SPA to run scripts, install software, gather inventory, and stream live perf data from hundreds of PCs at once. The pieces:

Component	What it is
kanade-agent	Service that runs on each managed PC. Subscribes to NATS, executes commands, ships results.
kanade-backend	HTTP API + projector. Persists state, serves the SPA, exposes operator endpoints.
kanade-client	Optional Tauri desktop app. End-user-facing surface.
NATS server	Message broker for command fan-out + result aggregation. The agent talks NATS-only; the backend reads NATS too.
kanade CLI	Operator-facing command line: publish binaries, fire jobs, query state.

This site covers two audiences:

Operators running a kanade fleet — how to update each component without ssh-ing into endpoints (see Agent-mediated updates).
Developers writing PowerShell jobs the agent will execute — what works, what doesn't, what changed in recent agent versions (see Writing scripts for the agent).

The detailed protocol / on-wire spec lives at Spec (the legacy single-page document; will be split into chapters as the docs site fills out).

Developer Quickstart

This guide gets you up and running with a local kanade development environment.

1. Prerequisites

Before starting, ensure you have the following installed on your Windows machine:

Rust toolchain (stable channel)
cargo-make (run cargo install --force cargo-make)
bun (for SPA dependency management and build execution)
gsudo (for local service deployment tests)
nats-server (runnable from PATH)

2. One-Time Setup

Run the following command at the workspace root to register the git pre-push hooks and install agent skills defined in apm.yml:

cargo make setup

3. Launching the Dev Sandbox

You can spin up a fully isolated, multi-component development stack on your local host using a single command:

cargo make dev

This task runs the following services concurrently in a loopback sandbox:

nats-dev: Unauthenticated NATS broker listening on port 4223.
backend-dev: Dev API server listening on port 8081 with auth disabled.
agent-dev: Local dev agent talking to the dev NATS broker on 4223.
web-dev: Vite dev server for the React SPA listening on http://localhost:5173.

Press Ctrl+C to tear down all components cleanly.

4. Multi-Agent Fleet Simulation

To debug behavior that only shows up when managing multiple machines (e.g. concurrent execution result projection or ID collisions), you can launch a multi-agent sandbox:

cargo make dev-fleet

This spawns the NATS broker, backend, and SPA, plus three separate dev agents with independent IDs (dev-pc-1, dev-pc-2, dev-pc-3) and isolated state databases.

5. Local Deploy Testing

If you want to test the full lifecycle of installing components as Windows services (mirroring production environments), use the local deployment scripts:

# Installs CLI, agent, backend, and NATS services locally via gsudo elevation
cargo make local-deploy

After deployment, you can verify and interact with the real Windows services. Use the following task to stop and cleanly delete the services when finished:

cargo make local-undeploy

System Architecture

kanade is designed to manage hundreds of Windows endpoints concurrently, safely, and asynchronously.

Component Topology

The system consists of five main components, coordinated through an event-driven pub/sub structure:

graph TD
    subgraph Operator Session
        CLI[kanade CLI]
        SPA[React SPA]
    end

    subgraph Server Infrastructure
        Backend[kanade-backend]
        NATS[NATS Broker / JetStream]
    end

    subgraph Windows Endpoints
        Agent1[kanade-agent PC-1]
        Agent2[kanade-agent PC-2]
        Client[kanade-client Tauri App]
    end

    CLI -->|Command / Query API| Backend
    SPA -->|REST / WebSockets| Backend
    Backend <-->|State/PubSub| NATS
    Agent1 <-->|NATS-only Connection| NATS
    Agent2 <-->|NATS-only Connection| NATS
    Client <-->|Tauri IPC| Agent1

1. kanade-agent

A high-performance Windows service running on each managed host.

Role: Core executor.
Communication: Establishes an outbound-only NATS connection. It does not open any inbound ports, making it firewall-friendly.
Capabilities: Launches secure, isolated PowerShell subprocesses, inventories hardware/software specs, streams live performance data (CPU, RSS memory, disk I/O), and manages local packages.

2. kanade-backend

The central HTTP API and projection server.

Role: Coordinates commands, processes incoming telemetry, and hosts the operator Web interface.
State management: Persists events, activity logs, and status records in a localized SQLite database.
Projector pattern: Subscribes to the NATS command-response stream, parses incoming payloads, and projects them into state tables in real-time.

3. NATS Broker (with JetStream)

The message transport layer of the entire fleet.

Role: Lightweight, high-throughput message broker.
JetStream: Retains command streams, job registrations, and file storage (using NATS Object Store buckets for distributing packages and agent scripts).
Isolation: Decouples the backend from the agents. If the backend is offline or restarting, agents continue execution and cache outbox records, pushing them once connection resumes.

4. kanade-client

An optional Tauri desktop application running in the logged-in user's desktop session on endpoints.

Role: Provides end-user interaction (e.g., prompt dialogs, notifications, or a user-facing dashboard).
Communication: Shares state with the local kanade-agent via secured local IPC mechanisms.

5. kanade CLI

The primary command-line tool for operators.

Role: Packages and publishes software updates, submits and executes job manifests, and queries live fleet inventory from the command-line.

Security & Reliability Design

Outbound-only Connections

Agents strictly communicate with the NATS broker by initiating outbound TCP connections. No firewall ports need to be opened on endpoints, neutralizing the risk of lateral traversal or external port scanning.

Agent Job Sandboxing

When executing scripts, the agent stages commands in %ProgramData%\Kanade\agent-scripts and executes them using customized launcher templates. Administrators can enforce identity configurations via job manifests, specifying run_as: system (for elevated system management) or run_as: user (to run safely under the active user's credentials with restricted directory ACLs).

Operations overview

Day-2 operations in kanade fall into two flows:

Direct install — drop binaries + config on a fresh host and register the Windows service. Used to bootstrap the first agent, the initial backend, and the NATS server. Scripts: scripts/deploy/agent.ps1, scripts/deploy/backend.ps1, scripts/deploy/nats.ps1. Run manually on the target host.
Agent-mediated update — once an agent is running, the agent itself can install / update other components on its own host without ssh / RDP. The operator publishes binaries + script bodies to the broker, then fires a job; the agent fetches, verifies, swaps, and restarts services. This is the bulk of day-2 operations.

The agent-mediated flow has the same shape regardless of what you're updating:

operator host ─► kanade CLI ─► NATS broker ─► agent (on target host)
                    │                              │
                    ├── publish binary ────────────► fetches from
                    │   to OBJECT_APP_PACKAGES        OBJECT_APP_PACKAGES
                    ├── publish script ────────────► fetches from
                    │   to OBJECT_SCRIPTS             OBJECT_SCRIPTS
                    ├── register / update job ─────► reads job manifest
                    │                                 from `jobs` KV
                    └── exec job ──────────────────► PowerShell child
                                                      runs the script

Component-specific guides:

kanade-backend — the HTTP / projector binary
kanade-client — the Tauri end-user app
NATS server — the broker itself (yes, you can update the broker over the broker)
kanade-agent itself — the agent self-update path (different from the other three; it uses a dedicated rollout bucket, not the generic OBJECT_APP_PACKAGES + script pair)

Installation and Deployment

This section details how to bootstrap kanade components as native Windows services in production or staging environments.

Deployment Model

Production hosts and target endpoints run kanade components as background Windows services. This ensures high availability and automatic startup.

Service Name	Triple / Binary	Config Source	Typical Target
KanadeNats	`nats-server.exe`	Hardened Registry / Registry-baked CLI flags	Central server
KanadeBackend	`kanade-backend.exe`	Hardened Registry / Config file	Central server
KanadeAgent	`kanade-agent.exe`	Hardened Registry / Local state DB	Managed endpoints

1. Prerequisites

Host OS: Windows 10/11 or Windows Server 2016+.
gsudo: Required to perform elevated installations from standard user shells (or run commands from an Administrator-level PowerShell prompt).
Network Routing: Managed endpoints must be able to reach the NATS server port (default 4222) over TCP.

2. Setting Up the NATS Server (Broker)

The NATS server acts as the messaging core.

Stage the deployment bundle using scripts/build-release.ps1 -Roles nats.

Deploy the service with elevation:

# Elevated PowerShell prompt
& "dist\nats\deploy-nats.ps1" -NatsToken "your-secure-nats-token" -Recreate

This installs the KanadeNats service, configures it to run under the local system account, sets up JetStream data directories, and locks down the secure authorization token in the Windows registry.

3. Deploying the Backend API & SPA

The backend manages operator connections and processes event logs.

Stage the backend binaries and React SPA bundle using scripts/build-release.ps1 -Roles backend.

Deploy the service:

# Elevated PowerShell prompt
& "dist\backend\deploy-backend.ps1" `
    -NatsToken "your-secure-nats-token" `
    -StaticToken "your-operator-spa-bearer-token" `
    -ForceConfig -Recreate

-NatsToken: Connects the backend to the local NATS server securely.
-StaticToken: Defines the API bearer token required for operator CLI/SPA logins.

The deployment script registers the KanadeBackend Windows service, sets the appropriate ACLs, and verifies the endpoint.

4. Installing the Agent on Target Endpoints

Install the agent on every endpoint PC that you want to manage.

Stage the agent bundle using scripts/build-release.ps1 -Roles agent.
Copy the contents of the dist/agent folder to the target PC.

On the target PC, run the installer:

# Elevated PowerShell prompt
& ".\deploy-agent.ps1" -NatsToken "your-secure-nats-token" -ForceConfig -Recreate

The script:

Places kanade-agent.exe into its destination directory.
Secures the configuration and NATS token in the Windows registry path (HKLM:\SOFTWARE\Kanade\agent).
Registers and starts the KanadeAgent service.

Once the service is active, the agent establishes an outbound NATS connection, subscribes to command streams, and reports its online heartbeat back to the fleet backend.

Agent-mediated updates

The agent is the universal installer. Once it's running on a target host, the operator never needs to touch the host directly to update any other component — including the backend it talks to, the broker that carries its messages, and the agent itself.

This chapter has one page per component:

Common machinery used by all of them:

Bucket / Stream	Purpose
`OBJECT_APP_PACKAGES`	Generic binary storage (backend, client, NATS server, …). Keyed by `<name>/<version>`.
`OBJECT_SCRIPTS`	PowerShell script bodies referenced by manifests via `script_object`. Keyed by `<name>/<version>`.
`OBJECT_AGENT_RELEASES`	Agent binaries only. Separate from `APP_PACKAGES` because agent rollout has its own watcher / target_version flow.
`agent_config` (KV)	Layered config — global / per-group / per-PC. `target_version` lives here.
`jobs` (KV)	Job catalog. Each entry is a manifest the operator can `exec`.

The CLI surface:

Command	What it does
`kanade app publish <name> <version> <file>`	Upload to `OBJECT_APP_PACKAGES`.
`kanade script publish <name> <version> <file>`	Upload to `OBJECT_SCRIPTS`.
`kanade job create <yaml>`	Upsert a job manifest into the `jobs` KV.
`kanade exec <job-id> --pcs <pc> [--pcs <pc> …]`	Fire a registered job at a set of PCs.
`kanade agent publish <file>`	Upload an agent binary (version extracted from PE VERSIONINFO).
`kanade agent rollout <version> --pc \| --group \| --global`	Flip `target_version` on the chosen scope; agents pick it up via their self-update watcher.

Updating kanade-backend

The backend lives on one (or more) of your managed hosts as a Windows service. Agent-mediated update means: an agent running on that host stops the service, swaps the binary, starts it back up — while the operator never logs into the host.

End-to-end flow

┌── operator host ──────────────────────────────────────────┐
│  1. build kanade-backend.exe                              │
│  2. kanade app publish kanade-backend <v> <exe>           │
│  3. edit deploy-backend.ps1 (set $AgentSource* knobs)     │
│  4. kanade script publish deploy-backend <v> <edited.ps1> │
│  5. kanade job create install-kanade-backend.yaml         │
│  6. kanade exec install-kanade-backend --pcs <host>       │
└────────────────────────────────────────────────────────────┘
                      │
                      ▼
┌── target host (running kanade-agent as LocalSystem) ──────┐
│  • agent receives the Command on commands.pc.<host>       │
│  • fetches deploy-backend.ps1 from OBJECT_SCRIPTS         │
│    sha-verifies it (`script_object` machinery, #214)      │
│  • stages it under                                        │
│    C:\ProgramData\Kanade\agent-scripts\<UUID>\            │
│    kanade-<UUID>.ps1                                      │
│  • runs `powershell -File <launcher>` (PR #230 fix)       │
│  • launcher invokes the user script via `& '...'`         │
│    so [CmdletBinding()] / param() headers parse           │
│  • script downloads kanade-backend.exe from               │
│    OBJECT_APP_PACKAGES (via /api/app-packages/…)          │
│    sha-verifies it (separate hash, on the exe itself)     │
│  • Stop-Service KanadeBackend                             │
│  • copy exe over C:\Program Files\Kanade\…                │
│  • Start-Service KanadeBackend                            │
│  • exit 0 — result published to NATS                      │
└────────────────────────────────────────────────────────────┘

The two sha checks are intentional: the script body's hash is verified by the agent before execution (script integrity); the binary's hash is verified by the script before the swap (binary integrity, defined by the operator in $AgentSourceSha256).

Step-by-step

1. Build kanade-backend

cargo build --release -p kanade-backend

Output: target/release/kanade-backend.exe.

2. Publish the binary

kanade app publish kanade-backend 0.43.0 target/release/kanade-backend.exe

This uploads the binary to OBJECT_APP_PACKAGES/kanade-backend/0.43.0 and prints the sha-256 digest. Copy the digest — you'll need its lowercase-hex form for the script.

3. Edit `scripts/deploy/backend.ps1`

Make a local copy. Set the four $Agent* knobs at the top:

$AgentSourceUrl       = 'http://kanade-backend.example.com:8080'
$AgentSourceVersion   = '0.43.0'
$AgentSourceSha256    = '<lowercase hex of kanade-backend.exe>'
$AgentSourceAuthToken = '<bearer for the backend HTTP API>'

Leave the rest of the script alone — those knobs are how the script knows it's running in "agent mode" (downloading from the backend) vs the manual-install mode (local folder of files).

The $AgentSourceSha256 is the hex form of Get-FileHash kanade-backend.exe -Algorithm SHA256.

If you only have the base64url form printed by kanade app publish, decode it. The base64 from the CLI is URL-safe and may be unpadded, so the PowerShell snippet needs to re-pad before FromBase64String accepts it:
$b64 = '<paste the SHA-256= value here, without the SHA-256= prefix>'
$b64 = $b64.Replace('-', '+').Replace('_', '/')
if ($b64.Length % 4) { $b64 += '=' * (4 - $b64.Length % 4) }
[BitConverter]::ToString([Convert]::FromBase64String($b64)).Replace('-', '').ToLowerInvariant()
Or in Python: python -c "import base64; print(base64.urlsafe_b64decode('<b64>' + '=' * (-len('<b64>') % 4)).hex())"

The $AgentSourceAuthToken is required as of the live test on 2026-05-26 — the backend's /api/app-packages/<name>/<ver> endpoint returns HTTP 401 without it. Leave empty only for no-auth lab setups.

4. Publish the edited script

kanade script publish deploy-backend 0.43.0 .\deploy-backend.edited.ps1

Upload goes to OBJECT_SCRIPTS/deploy-backend/0.43.0.

5. Register / update the job

configs/jobs/installers/install-kanade-backend.yaml in the repo is the template. Edit version: + script_object: to point at the version you just published, then upsert:

id: install-kanade-backend
version: 0.43.0
execute:
  shell: powershell
  script_object: deploy-backend/0.43.0
  timeout: 300s
  run_as: system
require_approval: true

kanade job create jobs\install-kanade-backend.yaml

run_as: system is required: Stop-Service / Start-Service / sc.exe all need admin. The agent already runs as LocalSystem in production.

6. Fire it

kanade exec install-kanade-backend --pcs <backend-host>

The CLI returns an exec_id immediately. The actual install happens asynchronously on the target.

7. Verify

Query the backend's results endpoint (or watch the SPA Activity view):

curl -H "Authorization: Bearer <token>" `
  "http://<backend>/api/results?limit=5"

Look for your exec_id with exit_code: 0 and a stdout that ends with kanade-backend <new-version>.

What can go wrong

Symptom	Cause	Fix
`[CmdletBinding()]` / `param()` parse error in stderr	Agent older than 0.42.2 (running `-Command` mode)	Upgrade the agent first via `kanade agent rollout` (see agent self-update).
`Start-BitsTransfer : HTTP status 401`	`$AgentSourceAuthToken` empty but backend requires auth	Set it.
`Start-BitsTransfer : The transfer encountered an error` / job state `TransientError`	BITS service not running, or target machine's WinHTTP can't reach `$AgentSourceUrl`	`Get-Service BITS`; check WinHTTP proxy with `netsh winhttp show proxy` (BITS uses WinHTTP, not IE/WinINet).
`sha256 mismatch — expected=<x> actual=<y>`	Hash in script doesn't match the published binary	Re-publish or recompute the hash. The script aborts BEFORE the swap, so the existing install is intact.
Job runs but kanade-backend doesn't come back up	Service-failure / config drift on target	Read `C:\ProgramData\Kanade\log\backend.*.log` on the target. The agent can fetch it via `kanade logs <pc>` (when implemented) or you can pull the file directly.

Updating kanade-client

The Tauri desktop client is shipped to endpoints the same way as the backend: binary in OBJECT_APP_PACKAGES, script in OBJECT_SCRIPTS, job in jobs KV. The shape mirrors backend updates — only the script content and package name differ.

What's different from backend updates

Aspect	kanade-backend	kanade-client
Service to (re)start	`KanadeBackend` (Windows service)	None — the client is launched by the user
Install location	`%ProgramFiles%\Kanade\kanade-backend.exe`	`%ProgramFiles%\Kanade\kanade-client.exe`
Script in repo	`scripts/deploy/backend.ps1`	`configs/jobs/installers/scripts/install-kanade-client.ps1` (lives in the manifest's script_file path)
Manifest file ref	`script_object: deploy-backend/<v>`	`script_file: scripts/install-kanade-client.ps1` (relative to the manifest YAML; inlined at `kanade job create`)
Atomic swap pattern	Stop service → copy → start service	Stage to `<exe>.new` → `Move-Item` → drop `<exe>.old`
Inventory projection	None (the backend reports its own version)	`inventory:` block emits per-PC client version into the SPA Inventory page

Both shapes — script_object (referenced by hash from OBJECT_SCRIPTS, agent fetches on demand) and script_file (script body inlined into the manifest at kanade job create time) — are supported. The client manifest uses script_file for historical reasons; the backend manifest uses script_object because it was rewritten to test the Object Store path.

Step-by-step

1. Build kanade-client

cargo build --release -p kanade-client

Output: target/release/kanade-client.exe.

2. Publish the binary

kanade app publish kanade-client 0.42.0 target/release/kanade-client.exe

3. Edit `configs/jobs/installers/scripts/install-kanade-client.ps1`

Set the three knobs at the top:

$BackendBase    = 'http://kanade-backend.example.com:8080'
$Version        = '0.42.0'
$ExpectedSha256 = '<lowercase hex of kanade-client.exe>'

Set $ClientSourceAuthToken to the backend's bearer when auth is enabled — same token the agent uses against the rest of /api/*. Leave it blank for dev / smoke-test setups where the /api/app-packages/kanade-client/<v> route is unauthenticated. Mirrors the $AgentSourceAuthToken knob in scripts/deploy/backend.ps1.

4. Register / update the job

configs/jobs/installers/install-kanade-client.yaml:

id: install-kanade-client
version: 0.42.0
execute:
  shell: powershell
  script_file: scripts/install-kanade-client.ps1   # body inlined at `job create` (relative to the manifest YAML)
  timeout: 180s
  run_as: system

require_approval: true

inventory:
  display:
    - { field: version, label: Version }
    - { field: path,    label: Install path }
  summary:
    - { field: version, label: Client version }

kanade job create jobs\install-kanade-client.yaml

The inventory: block tells the projector that the script's stdout is a single JSON blob whose version / path fields populate the SPA's Inventory page. Operators can spot stragglers from a fleet-wide table — no ssh needed.

5. Fire it

kanade exec install-kanade-client --pcs <host> [--pcs <host> …]

Or against a group:

kanade exec install-kanade-client --groups office

6. Verify in the SPA

Open the SPA Inventory page (or query /api/inventory?app=kanade-client) and confirm the target hosts report the new version.

Updating NATS server

Updating the broker on a managed host is the most interesting case because the agent talks to the broker over the broker — stopping NATS means losing the agent's connection mid-job. The machinery handles this with two mechanisms working together:

Reconnect. The agent's NATS client reconnects automatically on broker restart. No human intervention needed.
Outbox. Job results produced while the broker is down are queued under %ProgramData%\Kanade\outbox\ and replayed once the connection comes back. The result row reaches the backend as soon as the new NATS server is up.

So the flow looks the same as backend updates — the script Stops the service, swaps the binary, Starts it — and the agent transparently rides out the broker gap.

Caveats specific to NATS updates

Concern	Reality
Will the result row be lost?	No — outbox persists it across the broker outage and drains on reconnect.
Can I update from the SPA?	Yes, same as any job — `kanade exec install-kanade-nats --pcs <broker-host>`.
What if NATS doesn't come back up?	The result will sit in the outbox indefinitely. Operators should monitor `outbox/` on the broker host as a leading indicator.
What if the new NATS version is incompatible (JetStream upgrade etc.)?	Roll a single canary first (`--pcs <one-broker>`), watch outbox + backend health, then roll out fleet-wide. The 5-min cache TTL for SPA queries means you'll see the canary's state within a few minutes.

Manual install (bootstrap)

For the very first install — when there's no agent on the broker host yet — use the direct workflow:

.\scripts\build-release.ps1 -Roles nats       # fetches nats-server.exe
                                              # from github.com/nats-io/nats-server/releases
.\scripts\deploy\nats.ps1 -NatsToken '<token>'

This installs nats-server.exe to %ProgramFiles%\Kanade\ and nats-server.conf to %ProgramData%\Kanade\config\ (with ACL hardened to SYSTEM + Administrators because the bearer token lives in plaintext), registers the KanadeNats Windows service, opens TCP 4222 (broker) + 8222 (monitoring HTTP), and starts the service.

Agent-mediated update (steady state)

scripts/deploy/nats.ps1 ships the $AgentSource* knobs (#234), so the broker can be upgraded through the fleet — no RDP to the broker host.

1. Build / fetch nats-server.exe

Either:

.\scripts\build-release.ps1 -Roles nats   # fetches the binary

…or download it directly from github.com/nats-io/nats-server/releases.

2. Publish the binary

kanade app publish nats-server 2.10.20 .\nats-server.exe

3. Edit deploy-nats.ps1

The pattern matches deploy/backend.ps1:

$AgentSourceUrl       = 'http://kanade-backend.example.com:8080'
$AgentSourceVersion   = '2.10.20'
$AgentSourceSha256    = '<lowercase hex of nats-server.exe>'
$AgentSourceAuthToken = '<bearer for the backend HTTP API>'

4. Publish + register + exec

kanade script publish deploy-nats 2.10.20 .\deploy-nats.edited.ps1
kanade job create jobs\install-kanade-nats.yaml
kanade exec install-kanade-nats --pcs <broker-host>

The job manifest ships at configs/jobs/installers/install-kanade-nats.yaml:

id: install-kanade-nats
version: 2.10.20
execute:
  shell: powershell
  script_object: deploy-nats/2.10.20
  timeout: 300s
  run_as: system
require_approval: true

5. Verify

After the broker comes back, the outbox drains and you'll see the result row in /api/results. Confirm the new NATS version via the broker's monitoring endpoint:

curl http://<broker>:8222/varz | python -m json.tool | rg version

Why we don't need a separate "broker update" mechanism

Earlier designs considered a dedicated bootstrap channel (parallel NATS link the agent uses just for broker updates) to avoid the self-update-over-broker chicken-and-egg. The outbox + reconnect pair makes that unnecessary: the result is "merely delayed", not "lost". One transport, one mental model.

Updating kanade-agent itself

Agent self-update is the only component that doesn't use OBJECT_APP_PACKAGES + a script_object job. It has dedicated machinery because the agent has to swap its own running binary without ssh — a tighter loop than the generic install jobs.

Mechanism

Bucket / Key	Purpose
`OBJECT_AGENT_RELEASES`	Agent binaries, keyed by `<version>`. Separate from `OBJECT_APP_PACKAGES` so the rollout watcher only fires on agent updates.
`agent_config.<scope>.target_version`	The version each scope (global / group / pc) should be on. Watched by the agent's `self_update` loop.

Flow:

1. agent.self_update watches agent_config for target_version
2. If target_version != my agent_version:
   a. Pull `OBJECT_AGENT_RELEASES/<target_version>` to <exe>.new
   b. Sha-verify against the bucket's recorded digest
   c. Atomic swap: <exe> ← <exe>.new (via SCM stop/start)
   d. New binary boots, watcher arms again, loop closes

The rollout watcher has to survive a cold broker (e.g. agent and broker boot at the same time after a host reboot). Pre-#226 a permanent Err(_) => return; on the first get_object_store call killed the watcher forever; the agent would never self-update on that boot. Post-#226 the watcher retries with backoff until the broker is reachable.

Step-by-step

1. Build the agent

cargo build --release -p kanade-agent

Output: target/release/kanade-agent.exe.

2. Publish

kanade agent publish target/release/kanade-agent.exe

The CLI extracts the version from the PE VERSIONINFO resource — no --version flag, no chance of a label / binary mismatch.

3. Roll out

Pick a scope. Start with one canary host:

kanade agent rollout 0.42.2 --pcs canary-01

Watch via ping:

kanade ping canary-01     # agent_version should flip to 0.42.2
                          # within a few seconds

If happy, widen:

kanade agent rollout 0.42.2 --groups office --jitter 5m
# or fleet-wide
kanade agent rollout 0.42.2 --global --jitter 30m

--jitter spreads the actual swap moment across a window so a wide fan-out doesn't hammer the OS service manager on every host at once. Recommended for fleets ≥ 100 hosts.

4. Verify

kanade agent current
# → target_version = 0.42.2 (global)

Then a fleet-wide spot-check via the SPA Agents page (or /api/agents): the agent_version column should converge to the new version within jitter + ~30s heartbeat cadence.

What can go wrong

Symptom	Cause	Fix
`kanade agent rollout` says "version not in OBJECT_AGENT_RELEASES"	Typo or wrong scope	Re-check with `kanade agent current` and `kanade jetstream object list agent_releases`.
`kanade ping <host>` still shows the old version after several minutes	Agent didn't self-update — either the watcher's dead (pre-#226 agent) or the host can't reach the broker	Check `%ProgramData%\Kanade\log\agent.*.log` on the target. If self_update is silent (no "checking target_version" log lines), the agent is too old; bootstrap manually with `deploy-agent.ps1`.
Agent flaps: starts, immediately exits with `exit_code: 1`	The new binary is bad on this host (config drift, missing dep, etc.). SCM's failure-actions restart it, it crashes again — observable in Event Viewer as a Service Control Manager error cluster	Roll back: `kanade agent rollout <prev-version> --pcs <host>`. The host will swap back at the next watcher tick.

Why a separate bucket / scope?

OBJECT_APP_PACKAGES is a generic blob store keyed by <name>/<version>. The agent rollout pattern needs:

A watcher that fires only on agent changes (cheap KV watch on one specific key, not a poll over a bucket of many names).
A "current target" semantic per scope, not just "all known versions" — agent_config.<scope>.target_version IS the answer to "what should I be running" without the agent enumerating.
Operator UX (kanade agent publish / rollout) that's divergent enough from kanade app publish to warrant its own subcommand tree.

So agents get OBJECT_AGENT_RELEASES + a layered config KV; the other components share OBJECT_APP_PACKAGES + per-app jobs.

Removing kanade from a host (undeploy)

Production rollback path. When a host needs to come off kanade — because a rollout broke something, the host is being decommissioned, or you just want a clean slate to re-install from — there's one undeploy script per component, mirroring the deploy script that put it there.

Component	Deploy	Undeploy
Agent	`scripts/deploy/agent.ps1`	`scripts/undeploy/agent.ps1`
Backend	`scripts/deploy/backend.ps1`	`scripts/undeploy/backend.ps1`
NATS server	`scripts/deploy/nats.ps1`	`scripts/undeploy/nats.ps1`
Client (Tauri)	`configs/jobs/installers/scripts/install-kanade-client.ps1` (agent-driven)	`scripts/undeploy/client.ps1`

All four are admin-only and idempotent — safe to re-run after a partial uninstall, safe to run when the component is already gone (each step logs "not present, skipping" and moves on).

Default posture: safe

Run with no flags and the script:

Stops the Windows service.
Unregisters it from SCM (waits for the entry to actually disappear, so a re-deploy doesn't race a pending removal).
Removes the installed binary from %ProgramFiles%\Kanade\, including any half-completed <exe>.new / <exe>.old swap artefacts.
Removes any inbound firewall rule the deploy script created (pass -KeepFirewall to skip this — useful when an external WAF / Group Policy owns the rule).
Keeps everything under %ProgramData%\Kanade\ (config, logs, JetStream data, SQLite DB, …) so forensics / rollback / re-deploy can proceed without losing state.
Keeps registry-stored secrets at HKLM:\SOFTWARE\kanade\<role>\*.

That's enough for the common case: "this host's kanade is misbehaving, get it off without destroying state".

`-Purge`: destructive cleanup

Adds:

Removes the component's exclusive entries under %ProgramData%\Kanade\. Crucially, only the component's own files — agent / backend / NATS share the same root and each script avoids touching the others' files.
Removes the matching HKLM:\SOFTWARE\kanade\<role>\* key (unless -KeepSecrets is also passed — useful when multiple components share the same bearer).

Component	What `-Purge` removes
Agent	`config\agent.toml`, `logs\agent.*.log`, `outbox\`, `HKLM:\SOFTWARE\kanade\agent\`
Backend	`config\backend.toml`, `data\.db` (SQLite — historical results / inventory wiped), `logs\backend.*.log`, `HKLM:\SOFTWARE\kanade\backend\`
NATS	`config\nats-server.conf`, `nats\` (JetStream — all KV / Object Store / streams wiped), `logs\nats*.log`
Client	Nothing extra (no per-user state yet)

⚠️ undeploy-nats.ps1 -Purge and undeploy-backend.ps1 -Purge are the dangerous ones. The first wipes the fleet's entire JetStream state (agent_releases, app_packages, scripts, jobs, agent_config, results stream); the second wipes the projector's historical SQLite. Both are unrecoverable without out-of-band backups. The scripts print a loud banner before running.

Rollback recipes

Bad rollout on one canary host

# On the canary, as Admin:
.\scripts\undeploy\agent.ps1            # safe default
# kanade is now off the host. Re-deploy when ready:
.\scripts\deploy\agent.ps1 -SourceDir C:\path\to\prev-version

Decommission a host permanently

.\scripts\undeploy\agent.ps1 -Purge

Wipe a dev box for a clean re-install

.\scripts\undeploy\agent.ps1 -Purge
.\scripts\undeploy\backend.ps1 -Purge   # ⚠️ SQLite gone
.\scripts\undeploy\nats.ps1 -Purge      # ⚠️ JetStream gone
.\scripts\undeploy\client.ps1
# Now nothing about kanade exists on the box.

Rebuild a single bad service without touching state

.\scripts\undeploy\backend.ps1          # safe default: SQLite intact
.\scripts\deploy\backend.ps1 -Recreate  # fresh service registration, same data

What undeploy does NOT do

It doesn't notify the rest of the fleet that this host has gone away — the backend will keep listing it under "agents" until its heartbeat ages out (/api/agents staleness threshold). If you want it removed from the SPA immediately, delete the row via the backend API after undeploy.
It doesn't roll back the deployed binary to a previous version. "Roll back" in this script's vocabulary means "remove entirely"; if you want to swap to an older version, re-run the matching deploy-*.ps1 against a folder containing the older binary.
It doesn't touch NATS-side state when you remove the agent — the agent's target_version entry under agent_config.pcs.<pc> stays in the KV. Clean those up server-side with kanade jetstream kv del agent_config pcs.<pc>.target_version if needed.

Broker sizing & scaling

kanade runs on a single NATS + JetStream broker. As the fleet grows (hundreds → thousands of agents) the broker, not the agents, is the scaling bottleneck. This page covers the per-agent footprint, what the #512 work changed, the single-node limits to watch, and how to capture real numbers during a scale-up so you can decide whether to add a startup splay or grow/cluster the broker.

Per-agent consumer footprint

Each running agent holds a handful of JetStream consumers — one ordered push consumer per KV watch, plus its durable command-replay consumer:

Consumer	Source
`agent_config` watch (key-filtered)	`config_supervisor`
`agent_groups` watch (membership → effective config)	`config_supervisor`
`agent_groups` watch (membership → subscriptions)	`groups.rs`
`schedules` watch	`local_scheduler`
`jobs` watch	`local_scheduler`
`fleet_config` freeze watch (single key)	`local_scheduler`
`EXEC` durable replay	`command_replay`

That is ~7 consumers per agent. The count is roughly fixed per agent — it does not shrink with key-filtering — so the broker-side total grows linearly with the fleet:

Fleet size	~Consumers on the broker
15	~105
500	~3,500
3,000	~21,000

What v0.43.96 changed (and what it did not)

#512 (shipped in v0.43.96) attacked the super-linear costs, not the consumer count:

#832 — agent_config key-filtered watch. The agent watches only global, pcs.<self>, and its groups.<g> keys instead of the whole bucket (watch_all). This removes two blow-ups:
- Per-PC write fan-out: a write to one pcs.<id> no longer reaches all N agents (it used to, with N−1 classifying-and-dropping it).
- Reconnect re-sync storm: re-sync is now 3–5 direct gets per agent instead of a keys() + per-key walk over the whole bucket — aggregate O(N²) → O(N).
#839 — single-key freeze watch. fleet_config holds only KEY_FREEZE, so the freeze watcher uses watch(KEY_FREEZE) instead of watch_all + a client-side filter.

What is unchanged: the ~7-consumers-per-agent footprint, and the schedules / jobs watch_all (those are a shared catalog every agent must evaluate for targeting — removing them needs server-side targeting, a separate change). So at 3,000 agents you still provision for ~21,000 consumers and the connection/consumer-create burst of a synchronised reconnect.

Single-node limits and the reconnect herd

Two things to size for on a single-node JetStream:

Steady-state footprint — ~21,000 consumers at 3,000 agents live in the JetStream meta (Raft) layer and cost memory + file handles. Size the broker host's RAM and max_file/max_memory JetStream limits accordingly, and watch the meta layer's health.
The reconnect herd — when many agents reconnect at the same instant (broker restart, the morning power-on wave, a network event hitting many PCs), they re-establish connections and recreate their consumers in a burst. Key-filtering already cut each agent's re-sync to a few cheap gets, so the dangerous O(N²) read-storm is gone — but the connection + consumer-create burst is still O(N) and synchronised.

Note that random, unsynchronised reconnects (one laptop's wifi flap) are not a herd — only fleet-wide synchronised events are.

Levers, in order of preference

Broker sizing first. Give the single node enough RAM / file limits for the steady-state consumer count at your target N. This is the primary lever; everything else is secondary.
Startup splay — only if measured. A deterministic per-PC delay (hash(pc_id)) before the reconnect re-sync would smear the consumer-create / connection burst across a window. It is not in the product yet by design: PR1 already removed the quadratic term, and async-nats' reconnect backoff + nats_retry's ±25% jitter already spread the burst somewhat. A splay also adds latency to every single reconnect (including herd-less blips), so it is a net cost unless a herd actually stresses the broker. Decide from data (next section): if a synchronised event shows consumer-create latency, connection backlog, or JetStream API errors, add the splay (gated to the first sync after a Disconnected → Connected, capped a few seconds).
Reduce consumers per agent. Fold watches where possible (the freeze watch is already a single key; the two agent_groups watches are a candidate to merge) and keep KV history shallow (agent_config / agent_groups are at history: 1).
Cluster JetStream. Beyond what one node can hold, move to a JetStream cluster. This is the last resort and the biggest change.

Measuring at scale (retro-analysis)

You usually cannot watch a production broker live during a ramp. Capture the numbers instead, with the bundled collect: job, and review the bundle afterwards.

Run it on the backend / NATS host, ideally during or right after a synchronised reconnect event (the herd moment is what decides the splay question):

kanade exec collect-broker-health --pcs <backend-host-id>

The job (configs/jobs/collect-broker-health.yaml) samples the broker over ~3 minutes and uploads a bundle to OBJECT_COLLECTIONS; download it from the SPA Collect page (or hand the zip to your reviewer). It is read-only and needs zero pre-setup: it reads NATS' unauthenticated HTTP monitoring port (default 8222 — the same /jsz endpoint kanade already curls for jetstream status), so no nats CLI on the SYSTEM PATH and no token are required. Tune the window with KANADE_BH_SAMPLES / KANADE_BH_INTERVAL_SEC (and KANADE_BH_MON_PORT if the broker's http_port differs) on the target if needed.

The bundle contains:

Time series (connz-*.json, jsz-*.json) — connection count (/connz) and JetStream consumer count (/jsz) per sample. A spike here at the herd moment is the splay signal.
Consumer footprint (jsz-full.json) — /jsz?consumers=true&streams=true: the full consumer list; confirms the ~7N total and which streams hold them.
Server health/resources (varz.json, healthz.json) — /varz (memory, CPU, connections, slow consumers) and /healthz status.
Log tails (redacted) — backend and nats-server.

What to look for

Smooth consumer/connection counts across the samples, healthy /healthz, head-room on mem/cpu in /varz → the broker absorbed the event; no splay needed, just keep sizing ahead of N.
Spiky connection backlog or consumer-create latency at the event, JetStream API errors, or mem/cpu pegged → the herd is real → add the startup splay (lever 2) and/or grow the broker.

The #828 downgrade-flap regression is checked separately from the OBS_EVENTS agent_update timeline (via the backend API), not by this job — that data is already queryable fleet-wide.

Writing scripts for the agent

PowerShell scripts the agent will run are almost normal .ps1 files. This page collects the gotchas that aren't obvious from the script source alone.

The agent stages scripts on disk and runs them via `-File`

As of PR #230 (agent version 0.42.0+), the agent:

Writes your script body to a temp .ps1 under %ProgramData%\Kanade\agent-scripts\<UUID>\kanade-<UUID>.ps1 (Windows) or $TMPDIR/kanade-agent-<UUID>/kanade-<UUID>.ps1 (non-Windows dev only).
Writes a launcher .ps1 next to it that sets UTF-8 console encoding then & '<your-script>' @args.
Spawns powershell -NoProfile -NonInteractive -ExecutionPolicy Bypass -File <launcher>.

This means your script:

Can have [CmdletBinding()] and param(...) at the top. The call-operator boundary in the launcher gives your script its own scope where those headers are valid.
Should not rely on $PSCommandPath matching the operator's source path — it'll be the staged temp file.
Should not write to $PSScriptRoot (see next section).

Pre-0.42.0 agents used powershell -Command "<body>", which parses the body as a command-line expression and rejects [CmdletBinding()] as a syntax error. If you see "Unexpected token '[CmdletBinding()]'" in stderr, the host's agent is too old — upgrade it (see agent self-update).

`$PSScriptRoot` is read-only for `run_as: user`

When run_as: user (or system_gui), the child process runs as the logged-in user — not as the LocalSystem agent that wrote the staged file. The staging directory inherits its ACL from %ProgramData%, which grants users Read & Execute but not Modify.

That means:

# OK from any run_as
Get-ChildItem $PSScriptRoot              # list contents
Get-Content   $PSScriptRoot\anything     # read

# NG from run_as: user (access denied)
New-Item    -Path $PSScriptRoot\out.txt
Set-Content -Path $PSScriptRoot\log.log

Write to $env:TEMP, $env:LOCALAPPDATA, or an absolute path under the user's profile instead. Even for run_as: system (where SYSTEM can write to its own staged dir), the directory is cleaned up when the script exits, so writing siblings is fragile either way.

Identity table

`run_as:` (manifest)	Child identity	Reads `$PSScriptRoot`	Writes `$PSScriptRoot`	Has admin
`system` (default)	LocalSystem	✓	✓ but pointless (GC'd)	yes
`user`	Logged-in user	✓	✗ access denied	no
`system_gui`	LocalSystem, in user session	✓	✓ but pointless (GC'd)	yes

system_gui is the "PsExec -i -s" pattern — admin privilege but visible in the user's desktop session (useful for GUI tools that need both elevation and an interactive window).

stdout vs Write-Host

The backend's result projector reads stdout as the script's output. If your manifest has an inventory: block, stdout is parsed as a single JSON blob.

Do NOT use Write-Host for progress chatter in an inventory script. Contrary to a common assumption, Write-Host does NOT stay on a separate host stream once the agent runs your script with stdout redirected — its output bleeds INTO the captured stdout. That extra text breaks the projector's single-JSON-blob parse (serde_json::from_str over the whole stdout, in crates/kanade-backend/src/projector/results.rs::upsert_inventory), and your inventory fact is silently dropped (the backend logs stdout was not JSON).

Send progress chatter to stderr via [Console]::Error.WriteLine(...). stderr is captured into the result's separate stderr field, which the projector ignores, so stdout stays a single clean JSON line.

[Console]::Error.WriteLine("Downloading...")  # → stderr (logged, ignored by projector)
Write-Output ($obj | ConvertTo-Json)          # → stdout (the ONE JSON line, parsed)

Keep stdout to exactly one Write-Output — anything else on stdout (a stray Write-Host, Write-Output, or a cmdlet's pipeline output) that isn't the expected JSON will fail the inventory parse. See configs/jobs/installers/scripts/install-kanade-client.ps1 for a worked example.

UTF-8 by default

The launcher sets [Console]::OutputEncoding = UTF-8 and $OutputEncoding = UTF-8 before invoking your script, so any stdout / stderr you produce is UTF-8 regardless of the host's system codepage. Operator-shipped scripts with Japanese / DE / KR / CN strings show up correctly in the SPA Activity view without per-host workarounds.

If you explicitly need OEM / CP932 / Shift-JIS output (e.g. calling a legacy CLI that ignores $OutputEncoding), set it yourself in the script after the launcher prelude has run — your assignment takes precedence.

Native command exit codes

If your script ends with a successful native command run, the overall exit is 0 — that's PowerShell's default. If a native command fails ($LASTEXITCODE -ne 0) and you DON'T handle it, PowerShell still exits 0 — $ErrorActionPreference = 'Stop' does not save you here.

Windows PowerShell 5.1 (the default on Windows endpoints — and what the agent's powershell.exe resolves to) treats native command non-zero exits as non-terminating regardless of $ErrorActionPreference. PowerShell 7.3+ adds $PSNativeCommandUseErrorActionPreference = $true which makes them terminating, but that's not available in the deployment target. Always check $LASTEXITCODE explicitly.

The agent does NOT auto-propagate $LASTEXITCODE either — that would exit nonzero even when your script handled the native error gracefully. If you want the script's exit code to reflect a specific native call, propagate it yourself:

& git pull
if ($LASTEXITCODE -ne 0) { throw "git pull failed with exit code $LASTEXITCODE" }
# or, if you want the exact native code propagated:
if ($LASTEXITCODE -ne 0) { exit $LASTEXITCODE }

throw is usually preferable because it produces a clean PowerShell error record (which the trap { … break } cleanup pattern can intercept) and exits non-zero. exit $LASTEXITCODE is right when the caller cares about the exact code.

Timeouts

The manifest's timeout: is enforced by the agent. When it fires, the agent calls child.kill() on the PowerShell process — no graceful shutdown, no trap, no finally. Plan for it:

Budget the script to finish in timeout * 0.6 and leave headroom.
Use trap { ... ; break } for cleanup of resources that need explicit release (staging dirs, lock files) — trap fires on terminating errors, NOT on the agent's kill. Don't rely on it for the timeout case.
If you need cooperative cancellation, poll a sentinel file or a registry value and exit early. The agent has no way to send the script a graceful "wrap up" signal.

Killing a running job

kanade kill <exec_id> publishes a kill message the agent subscribes to. On receipt, the agent calls child.kill() — same hard-kill as the timeout path. Operators get an immediate result row marked Killed with whatever stdout / stderr the agent managed to capture before termination.

Developer Workflow and Contribution

This document outlines the standard workflows, lint/test requirements, and VCS branching guidelines for contributors working on the kanade codebase.

1. Quality Gates (Pre-Push & CI)

The local test suite must be green before you push or submit PRs. This matches the automated checks running on GitHub Actions.

# Run formatting check, clippy checks, target tests, and cargo lock checks
cargo make check

FMT & Clippy: We maintain a strict zero-warning policy. Do not sprinkle #[allow(clippy::...)] unless there is a strong architectural justification.
TDD (Test-Driven Development): Follow Kent Beck's TDD methodology. Write failing tests first to define the what, then implement the code to satisfy them.

2. Worktree Management with `renri`

For isolated feature development, we use renri to manage lightweight repository worktrees. This prevents staging pollution, keeps your main checkout clean, and allows you to switch tasks instantly.

Why `renri`?

In co-located Git and Jujutsu (jj) environments, managing worktrees manually can get complex. renri simplifies this by automatically wrapping VCS-specific worktree creation (favoring jj when configured) and cleanups.

Common Commands

# Create an isolated worktree (uses Jujutsu by default if present)
renri add feat/your-awesome-feature

# Force a Git-native worktree (bypassing jj)
renri --vcs git add feat/your-awesome-feature

# Clean up and delete a worktree after merging
renri remove feat/your-awesome-feature

# Garbage-collect and prune stale or broken worktrees
renri prune

Note: Worktree creation automatically invokes the cargo-make on-add hook to fetch remote refs and bootstrap APM configurations immediately.

3. Co-located Jujutsu (jj) & Git Workflow

Our development environment is configured with co-located Git and Jujutsu. We prefer jj for local version control due to its safe, conflict-free commit model.

Guidelines

No Direct Push to main: All changes must land via a Pull Request.
Branch/Bookmark Naming:
- feat/... for new features.
- fix/... for bugs.
- chore/... for infrastructure, dependency bumps, or releases.
Commit Messages: Write commit messages, PR titles, and bodies in English.
Version Bumps: Release version bumps are managed exclusively via PRs on main and automated tagging pipelines. Never run git tag manually.

4. Documentation Policy

Documentation must stay in lock-step with code changes. Whenever you add or modify features:

Update docstrings and comments explaining the why (avoid comments restating how).
Update the relevant book pages (written in English under book/src/).
Synchronize localization catalogs by running the translation template generator.

Spec (legacy single-page)

The full protocol / on-wire spec hasn't been migrated into the book yet. The authoritative source is the single-file version in the repo:

docs/SPEC.md on GitHub

Splitting it into chapters under this section is a follow-up once the rest of the operator / developer guides settle.

Configuration Reference

kanade services rely on structured configurations loaded from TOML files, environment variables, or registry paths.

1. Agent Configuration

The agent searches for its configuration via the KANADE_AGENT_CONFIG environment variable or falls back to native paths.

Dev Configuration (`configs/agent.dev.toml`)

# Dev configuration schema
[agent]
id = "dev-pc"
nats_url = "nats://localhost:4223"
data_dir = "target/dev-data/agent"

[log]
level = "debug"
file = "target/dev-data/agent/logs/agent.log"

Configuration Parameters

Field	Type	Description	Environment Override
`agent.id`	String	Unique hardware identifier (`pc_id`).	`KANADE_DEV_AGENT_ID` (templated)
`agent.nats_url`	String	Network address of the NATS broker.	`KANADE_NATS_URL`
`agent.data_dir`	Path	Root path to cache outbox scripts, state database, and local completions.	`KANADE_AGENT_DATA_DIR`
`log.level`	String	Logging verbosity (`error`, `warn`, `info`, `debug`, `trace`).	`RUST_LOG`
`log.file`	Path	Filepath destination for rolling logs.	-

2. Backend Configuration

The backend coordination layer retrieves its configurations from the file specified by KANADE_BACKEND_CONFIG or registers default structures.

Dev Configuration (`configs/backend.dev.toml`)

[backend]
listen_addr = "127.0.0.1:8081"
nats_url = "nats://localhost:4223"
database_url = "sqlite://target/dev-data/backend/state.db"

[auth]
# Auth settings

Configuration Parameters

Field	Type	Description	Environment Override
`backend.listen_addr`	String	Network bind address for HTTP/WebSocket traffic.	`KANADE_BIND_ADDR`
`backend.nats_url`	String	Target NATS broker URL.	`KANADE_NATS_URL`
`backend.database_url`	String	SQLite database connection string.	`DATABASE_URL`
`auth.disable`	Boolean	Set to true to disable operator token validation (dev environment only).	`KANADE_AUTH_DISABLE`

3. Windows Registry Integration

In production environments, security-sensitive tokens (like NATS client tokens and administrative API bearer tokens) are stored in the secure Windows Registry rather than plaintext files.

Key Paths

Agent Settings: HKLM:\SOFTWARE\Kanade\agent
Backend Settings: HKLM:\SOFTWARE\Kanade\backend

These registry paths are protected with local ACL configurations, allowing read permissions strictly to SYSTEM and designated operators.