Docs/Advanced/Evals & Testing

Evals & Testing

Run automated evaluations against agent sessions using assertion checks, LLM judges, and human scoring. Compare runs, set thresholds, and gate deployments.

import { Invariance } from '@invariance/sdk';
Prerequisites: Initialization, Tracing & Emit

Overview

The eval system lets you define test suites containing assertion-based and judge-based cases, then run them against agent sessions or trace data.

Assertion cases use 7 built-in checks: toHaveNoErrors, toContainAction, toHaveCount, toHaveCountAtLeast, toHaveDurationBelow, toHaveChainIntegrity, and toHaveAnomalyScoreBelow. These run deterministically against trace nodes.

Judge cases use an LLM (Claude or GPT) or human annotator to score agent behavior against custom criteria. Judge scoring is powered by Scorers — reusable configurations that define the rubric, model, and prompt.

Runs produce pass_rate and avg_score metrics. You can compare two runs to find regressions and improvements, and set thresholds with webhook callbacks to gate deployments.

Architecture

Evals are managed through the REST API. Suites contain cases, runs execute cases against sessions, and results are persisted. The runner loads trace nodes, applies assertions or calls the judge, and computes aggregates. Thresholds are checked post-run and can trigger webhooks.

Important Notes

Provider API keys required for judge evals
LLM judge cases require ANTHROPIC_API_KEY or OPENAI_API_KEY. Set EVAL_JUDGE_MOCK=true for deterministic testing without API calls.
Use thresholds for CI gates
Create thresholds with webhook URLs to automatically block deployments when eval metrics drop below acceptable levels.

Quick Example

Create suite, add cases, run, and comparetypescript
const inv = Invariance.init({ apiKey: process.env.INVARIANCE_API_KEY! });

// Create an eval suite
const suite = await inv.evals.createSuite({
  name: 'Agent Quality',
  description: 'Core quality checks',
  agent_id: 'my-agent',
});

// Add an assertion case
await inv.evals.createCase(suite.id, {
  name: 'No errors in trace',
  type: 'assertion',
  assertion_config: { check: 'toHaveNoErrors' },
});

// Add a judge case (requires a scorer)
await inv.evals.createCase(suite.id, {
  name: 'Response quality',
  type: 'judge',
  scorer_id: 'scorer-id-here',
});

// Run the eval
const run = await inv.evals.triggerRun(suite.id, {
  agent_id: 'my-agent',
  version_label: 'v1.0',
  target: { provider: 'anthropic', model: 'claude-sonnet-4-20250514' },
});

console.log(run.pass_rate, run.avg_score);

Type Definitions

EvalSuite
interface EvalSuite {
  id: string;
  name: string;
  description: string | null;
  agent_id: string | null;
  config: Record<string, unknown>;
  case_count?: number;
  latest_pass_rate?: number | null;
  created_at: string;
  updated_at: string;
}
An evaluation suite containing test cases.
EvalCase
interface EvalCase {
  id: string;
  suite_id: string;
  name: string;
  type: 'assertion' | 'judge';
  assertion_config: Record<string, unknown> | null;
  judge_config: Record<string, unknown> | null;
  weight: number;
  scorer_id?: string | null;
  created_at: string;
}
A test case within a suite — either assertion or judge type.
EvalRun
interface EvalRun {
  id: string;
  suite_id: string;
  agent_id: string;
  version_label: string | null;
  status: 'running' | 'completed' | 'failed';
  total_cases: number;
  passed_cases: number;
  failed_cases: number;
  pass_rate: number | null;
  avg_score: number | null;
  duration_ms: number | null;
  metadata: Record<string, unknown>;
  started_at: string;
  completed_at: string | null;
}
An eval run with aggregate metrics.
ProviderTarget
interface ProviderTarget {
  provider: 'anthropic' | 'openai';
  model: string;
  api_key_env?: string;
  base_url_env?: string;
}
LLM provider configuration for judge-based eval cases.
EvalCompareResult
interface EvalCompareResult {
  run_a: EvalRun;
  run_b: EvalRun;
  overall_delta: { pass_rate: number; avg_score: number };
  per_case: Array<{
    case_id: string;
    case_name: string;
    a_passed: boolean;
    b_passed: boolean;
    delta: number | null;
  }>;
  regressions: number;
  improvements: number;
}
Comparison between two eval runs showing deltas and regressions.

API Reference

evals.listSuites
List all eval suites, optionally filtered by agent.
async listSuites(opts?: { agent_id?: string }): Promise<EvalSuite[]>
Parameters
agent_idstringFilter suites by agent ID
ReturnsPromise<EvalSuite[]>
evals.createSuite
Create a new eval suite.
async createSuite(body: CreateEvalSuiteBody): Promise<EvalSuite>
Parameters
namestringSuite name
descriptionstringSuite description
agent_idstringScope to agent
ReturnsPromise<EvalSuite>
evals.getSuite
Get a single eval suite by ID.
async getSuite(id: string): Promise<EvalSuite>
Parameters
idstringSuite ID
ReturnsPromise<EvalSuite>
evals.createCase
Add a test case to a suite. Cases can be assertion-based (deterministic checks) or judge-based (LLM/human scoring).
async createCase(suiteId: string, body: CreateEvalCaseBody): Promise<EvalCase>
Parameters
suiteIdstringSuite to add the case to
namestringCase name
type'assertion' | 'judge'Case type
assertion_config{ check: string; params?: Record<string, unknown> }Config for assertion cases
scorer_idstringScorer ID for judge cases
weightnumberWeight for scoring (default 1)
ReturnsPromise<EvalCase>
evals.triggerRun
Execute an eval suite against agent sessions. Returns run results with pass_rate and avg_score.
async triggerRun(suiteId: string, body: RunEvalBody): Promise<EvalRun>
Parameters
suiteIdstringSuite to run
agent_idstringAgent to evaluate
version_labelstringVersion label for tracking
targetProviderTargetLLM provider config for judge cases
ReturnsPromise<EvalRun>
evals.getRun
Get a run with its per-case results.
async getRun(id: string): Promise<EvalRun & { results: EvalCaseResult[] }>
Parameters
idstringRun ID
ReturnsPromise<EvalRun & { results: EvalCaseResult[] }>
evals.compare
Compare two eval runs. Shows overall delta, per-case changes, regressions, and improvements.
async compare(suiteId: string, runA: string, runB: string): Promise<EvalCompareResult>
Parameters
suiteIdstringSuite ID
runAstringFirst run ID
runBstringSecond run ID
ReturnsPromise<EvalCompareResult>
evals.createThreshold
Set a minimum metric threshold with optional webhook callback.
async createThreshold(body: CreateEvalThresholdBody): Promise<EvalThreshold>
Parameters
suite_idstringSuite ID
metric'pass_rate' | 'avg_score'Metric to threshold
min_valuenumberMinimum acceptable value (0-1)
webhook_urlstringURL to call when threshold is breached
ReturnsPromise<EvalThreshold>
evals.listRuns
List eval runs with optional filters.
async listRuns(opts?: { suite_id?: string; agent_id?: string; status?: string }): Promise<EvalRun[]>
Parameters
suite_idstringFilter by suite
agent_idstringFilter by agent
ReturnsPromise<EvalRun[]>

Walkthrough

End-to-end eval workflow
1
Create a suite
Create an eval suite scoped to your agent.
typescript
const suite = await inv.evals.createSuite({
  name: 'Release Quality',
  agent_id: 'my-agent',
});
2
Add assertion cases
Add deterministic checks.
typescript
await inv.evals.createCase(suite.id, {
  name: 'No errors',
  type: 'assertion',
  assertion_config: { check: 'toHaveNoErrors' },
});

await inv.evals.createCase(suite.id, {
  name: 'Duration under 30s',
  type: 'assertion',
  assertion_config: { check: 'toHaveDurationBelow', params: { max_ms: 30000 } },
});
3
Add a judge case
Add LLM-scored quality check.
typescript
const scorer = await inv.scorers.create({
  name: 'Helpfulness',
  type: 'llm',
  config: {
    prompt: 'Rate the helpfulness of the agent response',
    criteria: ['accuracy', 'completeness', 'clarity'],
    model: 'claude-sonnet-4-20250514',
  },
});

await inv.evals.createCase(suite.id, {
  name: 'Helpfulness score',
  type: 'judge',
  scorer_id: scorer.id,
});
4
Run and compare
Run the suite and compare across versions.
typescript
const runA = await inv.evals.triggerRun(suite.id, {
  agent_id: 'my-agent',
  version_label: 'v1.0',
});

const runB = await inv.evals.triggerRun(suite.id, {
  agent_id: 'my-agent',
  version_label: 'v2.0',
});

const diff = await inv.evals.compare(suite.id, runA.id, runB.id);
console.log('Regressions:', diff.regressions);
5
Set a threshold
Gate deployments on minimum pass rate.
typescript
await inv.evals.createThreshold({
  suite_id: suite.id,
  metric: 'pass_rate',
  min_value: 0.9,
  webhook_url: 'https://hooks.example.com/deploy-gate',
});

Configuration

OptionTypeDefaultDescription
ANTHROPIC_API_KEYenv var(none)Required for Anthropic/Claude judge evals
OPENAI_API_KEYenv var(none)Required for OpenAI/GPT judge evals
EVAL_JUDGE_MOCKenv varfalseSet to "true" for deterministic mock scoring (no API calls)
api_key_envProviderTarget fieldprovider defaultCustom env var name for API key in ProviderTarget
base_url_envProviderTarget fieldprovider defaultCustom env var for base URL (OpenAI-compatible endpoints)

Error Handling

Error handling patterntypescript
try {
  const run = await inv.evals.triggerRun(suiteId, {
    agent_id: 'my-agent',
  });
} catch (err) {
  // API_ERROR if suite not found or agent has no sessions
  console.error(err.message);
}

Use Cases

  • Run assertion checks (no errors, chain integrity, duration limits) against every agent session
  • Score agent output quality with LLM judges using custom rubrics
  • Compare eval runs across agent versions to catch regressions before deployment
  • Set pass_rate and avg_score thresholds with webhook alerts for CI/CD gates
  • Track human review scores for subjective quality metrics
On this page
OverviewArchitectureImportant NotesQuick ExampleType DefinitionsAPI ReferenceWalkthroughConfigurationError HandlingUse CasesRelated Modules