Last updated by .

Testing and Evals

OpenCandle has four validation layers:

  • deterministic unit tests for pure logic, request-understanding fixtures, providers with mocked fetch, and GUI state helpers
  • focused end-to-end tests for CLI, credential flows, and live provider/tool behavior
  • browser smoke tests for the local GUI
  • evals that run full OpenCandle sessions through the shared agent harness and score product behavior, task selection, evidence use, or competitive performance

Baseline Checks

Run these before treating a checkout as healthy:

npm test
npm run gui:web:build
npm run docs:site:build

npm test runs the Vitest suite. Unit tests should be fixture-backed and should not call live APIs.

End-to-End Tool Tests

npm run test:e2e
npm run test:e2e:cli
npm run test:e2e:credential-prompt
npm run test:e2e:credential-snooze
npm run test:e2e:credential-soft-fallback
npm run test:e2e:credential-per-workflow-cap

npm run test:e2e intentionally hits live APIs through focused tool checks. The provider matrix is broader and also live:

npm run test:e2e:providers

Only run these when live network/API behavior is part of the validation goal.

Eval Commands

OpenCandle separates deterministic tests from opt-in evals because evals may depend on model credentials, live data, local agent CLIs, or longer-running traces.

npm run test:evals
npm run test:evals:usually
npm run eval:router-live
npm run test:evals:product
npm run test:evals:competitive
CommandWhat it runsWhen to use it
npm run test:evalsVitest eval cases under tests/evals/cases/**/*.eval.tsDeterministic or semi-deterministic scoring cases that should run as a suite.
npm run test:evals:usuallySame Vitest eval suite with EVAL_TIER=usuallyThe common eval tier when you want the usual subset rather than every case.
npm run eval:router-livetests/scripts/run-live-router-eval.ts against request-understanding fixtures with a live modelOpt-in task-selection quality check. Requires live model credentials and compares live output to fixture expectations.
npm run test:evals:producttests/scripts/run-product-evals.tsFull-session product evals over curated finance prompts, using the OpenCandle harness and rubric-style dimensions.
npm run test:evals:competitivetests/scripts/run-competitive-finance-eval.tsCompetitive finance benchmark against generic no-tool Claude, Codex, and Gemini baselines. See Benchmarking.

Eval reports are written under tests/evals/runs/ when a runner produces a JSON report. Treat those run files as local evidence, not committed documentation.

Product Evals

Product evals run curated prompts through runOpenCandleSession() and score the resulting trace for investigation fit, tool usage, directness, evidence use, risk framing, horizon fit, and honest handling of missing data.

Prompt families currently include:

  • single_asset
  • compare_assets
  • portfolio
  • options
  • sentiment
  • macro
  • education

Run all product evals:

npm run test:evals:product

Useful environment variables:

  • PRODUCT_EVAL_CASE: run one case by id, such as compare-assets-aapl-msft-6mo.
  • PRODUCT_EVAL_FAMILY: run one family, such as portfolio or macro.
  • PRODUCT_EVAL_LIMIT: run only the first N selected cases.

Example:

PRODUCT_EVAL_FAMILY=options PRODUCT_EVAL_LIMIT=1 npm run test:evals:product

Each run writes a timestamped *_product-evals.json report under tests/evals/runs/.

GUI Browser Smoke

Run the GUI in one terminal:

npm run gui

Then run the browser smoke in another terminal:

npm run test:gui:browser

Set OPENCANDLE_GUI_URL to target a non-default local URL. GUI smoke testing should cover desktop and mobile widths when UI behavior changes.

For visual or GUI behavior changes, also build the web bundle:

npm --workspace @opencandle/gui-web run build

At minimum, exercise prompts that render stock quotes, quote comparison, options chains, SEC filings, macro/FRED data, and news/search so the matching tool cards and financial context panel render from saved session state.

Agent Harness

The file-based harness lets another coding agent drive OpenCandle as a simulated user and inspect the resulting trace.

npx tsx tests/harness/cli.ts run --prompt "What is AAPL trading at?" --ipc /tmp/oc-test &
npx tsx tests/harness/cli.ts wait --ipc /tmp/oc-test
npx tsx tests/harness/cli.ts trace --ipc /tmp/oc-test

If the run asks a question:

npx tsx tests/harness/cli.ts answer --ipc /tmp/oc-test --value "Moderate"

The final trace.json includes tool calls, results, interactions, final text, duration, and OpenCandle custom entries such as workflow dispatch, request-understanding output, disclaimers, and degradation notes.

Request-Understanding Fixtures

Request-understanding fixtures live in tests/fixtures/router/ and are included in npm test.

Use them when changing:

  • src/routing/router-prompt.ts
  • src/routing/router.ts
  • task-selection model choice
  • multi-turn context handling
  • preference extraction or slot resolution

The live fixture eval is opt-in:

npm run eval:router-live

It uses OPENCANDLE_ROUTER_PROVIDER and OPENCANDLE_ROUTER_MODEL when set. Defaults are anthropic and claude-haiku-4-5, so it requires matching live model credentials unless you override those env vars.

Treat task-selection mismatches as regressions even when the aggregate pass rate looks acceptable.

Test Data Rules

  • Mock globalThis.fetch in unit tests.
  • Store response fixtures under tests/fixtures/<provider>/.
  • Do not commit real account balances, names, or exact holdings in fixtures.
  • Preserve classification-relevant signal such as tickers, horizons, and risk phrasing.
  • Keep live API checks out of the default unit test path.