Testing and Evals
OpenCandle has four validation layers:
- deterministic unit tests for pure logic, request-understanding fixtures, providers with mocked fetch, and GUI state helpers
- focused end-to-end tests for CLI, credential flows, and live provider/tool behavior
- browser smoke tests for the local GUI
- evals that run full OpenCandle sessions through the shared agent harness and score product behavior, task selection, evidence use, or competitive performance
Baseline Checks
Run these before treating a checkout as healthy:
npm test
npm run gui:web:build
npm run docs:site:build
npm test runs the Vitest suite. Unit tests should be fixture-backed and should not call live APIs.
End-to-End Tool Tests
npm run test:e2e
npm run test:e2e:cli
npm run test:e2e:credential-prompt
npm run test:e2e:credential-snooze
npm run test:e2e:credential-soft-fallback
npm run test:e2e:credential-per-workflow-cap
npm run test:e2e intentionally hits live APIs through focused tool checks. The provider matrix is broader and also live:
npm run test:e2e:providers
Only run these when live network/API behavior is part of the validation goal.
Eval Commands
OpenCandle separates deterministic tests from opt-in evals because evals may depend on model credentials, live data, local agent CLIs, or longer-running traces.
npm run test:evals
npm run test:evals:usually
npm run eval:router-live
npm run test:evals:product
npm run test:evals:competitive
| Command | What it runs | When to use it |
|---|---|---|
npm run test:evals | Vitest eval cases under tests/evals/cases/**/*.eval.ts | Deterministic or semi-deterministic scoring cases that should run as a suite. |
npm run test:evals:usually | Same Vitest eval suite with EVAL_TIER=usually | The common eval tier when you want the usual subset rather than every case. |
npm run eval:router-live | tests/scripts/run-live-router-eval.ts against request-understanding fixtures with a live model | Opt-in task-selection quality check. Requires live model credentials and compares live output to fixture expectations. |
npm run test:evals:product | tests/scripts/run-product-evals.ts | Full-session product evals over curated finance prompts, using the OpenCandle harness and rubric-style dimensions. |
npm run test:evals:competitive | tests/scripts/run-competitive-finance-eval.ts | Competitive finance benchmark against generic no-tool Claude, Codex, and Gemini baselines. See Benchmarking. |
Eval reports are written under tests/evals/runs/ when a runner produces a JSON report. Treat those run files as local evidence, not committed documentation.
Product Evals
Product evals run curated prompts through runOpenCandleSession() and score the resulting trace for investigation fit, tool usage, directness, evidence use, risk framing, horizon fit, and honest handling of missing data.
Prompt families currently include:
single_assetcompare_assetsportfoliooptionssentimentmacroeducation
Run all product evals:
npm run test:evals:product
Useful environment variables:
PRODUCT_EVAL_CASE: run one case by id, such ascompare-assets-aapl-msft-6mo.PRODUCT_EVAL_FAMILY: run one family, such asportfolioormacro.PRODUCT_EVAL_LIMIT: run only the first N selected cases.
Example:
PRODUCT_EVAL_FAMILY=options PRODUCT_EVAL_LIMIT=1 npm run test:evals:product
Each run writes a timestamped *_product-evals.json report under tests/evals/runs/.
GUI Browser Smoke
Run the GUI in one terminal:
npm run gui
Then run the browser smoke in another terminal:
npm run test:gui:browser
Set OPENCANDLE_GUI_URL to target a non-default local URL. GUI smoke testing should cover desktop and mobile widths when UI behavior changes.
For visual or GUI behavior changes, also build the web bundle:
npm --workspace @opencandle/gui-web run build
At minimum, exercise prompts that render stock quotes, quote comparison, options chains, SEC filings, macro/FRED data, and news/search so the matching tool cards and financial context panel render from saved session state.
Agent Harness
The file-based harness lets another coding agent drive OpenCandle as a simulated user and inspect the resulting trace.
npx tsx tests/harness/cli.ts run --prompt "What is AAPL trading at?" --ipc /tmp/oc-test &
npx tsx tests/harness/cli.ts wait --ipc /tmp/oc-test
npx tsx tests/harness/cli.ts trace --ipc /tmp/oc-test
If the run asks a question:
npx tsx tests/harness/cli.ts answer --ipc /tmp/oc-test --value "Moderate"
The final trace.json includes tool calls, results, interactions, final text, duration, and OpenCandle custom entries such as workflow dispatch, request-understanding output, disclaimers, and degradation notes.
Request-Understanding Fixtures
Request-understanding fixtures live in tests/fixtures/router/ and are included in npm test.
Use them when changing:
src/routing/router-prompt.tssrc/routing/router.ts- task-selection model choice
- multi-turn context handling
- preference extraction or slot resolution
The live fixture eval is opt-in:
npm run eval:router-live
It uses OPENCANDLE_ROUTER_PROVIDER and OPENCANDLE_ROUTER_MODEL when set. Defaults are anthropic and claude-haiku-4-5, so it requires matching live model credentials unless you override those env vars.
Treat task-selection mismatches as regressions even when the aggregate pass rate looks acceptable.
Test Data Rules
- Mock
globalThis.fetchin unit tests. - Store response fixtures under
tests/fixtures/<provider>/. - Do not commit real account balances, names, or exact holdings in fixtures.
- Preserve classification-relevant signal such as tickers, horizons, and risk phrasing.
- Keep live API checks out of the default unit test path.