Evals & Testing

Run automated evaluations against agent sessions using assertion checks, LLM judges, and human scoring. Compare runs, set thresholds, and gate deployments.

import { Invariance } from '@invariance/sdk';

Prerequisites: Initialization, Tracing & Emit

Overview

The eval system lets you define test suites containing assertion-based and judge-based cases, then run them against agent sessions or trace data.

Assertion cases use 7 built-in checks: toHaveNoErrors, toContainAction, toHaveCount, toHaveCountAtLeast, toHaveDurationBelow, toHaveChainIntegrity, and toHaveAnomalyScoreBelow. These run deterministically against trace nodes.

Judge cases use an LLM (Claude or GPT) or human annotator to score agent behavior against custom criteria. Judge scoring is powered by Scorers — reusable configurations that define the rubric, model, and prompt.

Runs produce pass_rate and avg_score metrics. You can compare two runs to find regressions and improvements, and set thresholds with webhook callbacks to gate deployments.

Architecture

Evals are managed through the REST API. Suites contain cases, runs execute cases against sessions, and results are persisted. The runner loads trace nodes, applies assertions or calls the judge, and computes aggregates. Thresholds are checked post-run and can trigger webhooks.

Important Notes

Provider API keys required for judge evals

LLM judge cases require ANTHROPIC_API_KEY or OPENAI_API_KEY. Set EVAL_JUDGE_MOCK=true for deterministic testing without API calls.

Use thresholds for CI gates

Create thresholds with webhook URLs to automatically block deployments when eval metrics drop below acceptable levels.

Quick Example

Create suite, add cases, run, and comparetypescript

const inv = Invariance.init({ apiKey: process.env.INVARIANCE_API_KEY! });

// Create an eval suite
const suite = await inv.evals.createSuite({
  name: 'Agent Quality',
  description: 'Core quality checks',
  agent_id: 'my-agent',
});

// Add an assertion case
await inv.evals.createCase(suite.id, {
  name: 'No errors in trace',
  type: 'assertion',
  assertion_config: { check: 'toHaveNoErrors' },
});

// Add a judge case (requires a scorer)
await inv.evals.createCase(suite.id, {
  name: 'Response quality',
  type: 'judge',
  scorer_id: 'scorer-id-here',
});

// Run the eval
const run = await inv.evals.triggerRun(suite.id, {
  agent_id: 'my-agent',
  version_label: 'v1.0',
  target: { provider: 'anthropic', model: 'claude-sonnet-4-20250514' },
});

console.log(run.pass_rate, run.avg_score);

Type Definitions

EvalSuite

interface EvalSuite {
  id: string;
  name: string;
  description: string | null;
  agent_id: string | null;
  config: Record<string, unknown>;
  case_count?: number;
  latest_pass_rate?: number | null;
  created_at: string;
  updated_at: string;
}

An evaluation suite containing test cases.

EvalCase

interface EvalCase {
  id: string;
  suite_id: string;
  name: string;
  type: 'assertion' | 'judge';
  assertion_config: Record<string, unknown> | null;
  judge_config: Record<string, unknown> | null;
  weight: number;
  scorer_id?: string | null;
  created_at: string;
}

A test case within a suite — either assertion or judge type.

EvalRun

interface EvalRun {
  id: string;
  suite_id: string;
  agent_id: string;
  version_label: string | null;
  status: 'running' | 'completed' | 'failed';
  total_cases: number;
  passed_cases: number;
  failed_cases: number;
  pass_rate: number | null;
  avg_score: number | null;
  duration_ms: number | null;
  metadata: Record<string, unknown>;
  started_at: string;
  completed_at: string | null;
}

An eval run with aggregate metrics.

ProviderTarget

interface ProviderTarget {
  provider: 'anthropic' | 'openai';
  model: string;
  api_key_env?: string;
  base_url_env?: string;
}

LLM provider configuration for judge-based eval cases.

EvalCompareResult

interface EvalCompareResult {
  run_a: EvalRun;
  run_b: EvalRun;
  overall_delta: { pass_rate: number; avg_score: number };
  per_case: Array<{
    case_id: string;
    case_name: string;
    a_passed: boolean;
    b_passed: boolean;
    delta: number | null;
  }>;
  regressions: number;
  improvements: number;
}

Comparison between two eval runs showing deltas and regressions.

API Reference

evals.listSuites

List all eval suites, optionally filtered by agent.

async listSuites(opts?: { agent_id?: string }): Promise<EvalSuite[]>

Parameters

agent_idstringFilter suites by agent ID

ReturnsPromise<EvalSuite[]>

evals.createSuite

Create a new eval suite.

async createSuite(body: CreateEvalSuiteBody): Promise<EvalSuite>

Parameters

namestringSuite name

descriptionstringSuite description

agent_idstringScope to agent

ReturnsPromise<EvalSuite>

evals.getSuite

Get a single eval suite by ID.

async getSuite(id: string): Promise<EvalSuite>

Parameters

idstringSuite ID

ReturnsPromise<EvalSuite>

evals.createCase

Add a test case to a suite. Cases can be assertion-based (deterministic checks) or judge-based (LLM/human scoring).

async createCase(suiteId: string, body: CreateEvalCaseBody): Promise<EvalCase>

Parameters

suiteIdstringSuite to add the case to

namestringCase name

type'assertion' | 'judge'Case type

assertion_config{ check: string; params?: Record<string, unknown> }Config for assertion cases

scorer_idstringScorer ID for judge cases

weightnumberWeight for scoring (default 1)

ReturnsPromise<EvalCase>

evals.triggerRun

Execute an eval suite against agent sessions. Returns run results with pass_rate and avg_score.

async triggerRun(suiteId: string, body: RunEvalBody): Promise<EvalRun>

Parameters

suiteIdstringSuite to run

agent_idstringAgent to evaluate

version_labelstringVersion label for tracking

targetProviderTargetLLM provider config for judge cases

ReturnsPromise<EvalRun>

evals.getRun

Get a run with its per-case results.

async getRun(id: string): Promise<EvalRun & { results: EvalCaseResult[] }>

Parameters

idstringRun ID

ReturnsPromise<EvalRun & { results: EvalCaseResult[] }>

evals.compare

Compare two eval runs. Shows overall delta, per-case changes, regressions, and improvements.

async compare(suiteId: string, runA: string, runB: string): Promise<EvalCompareResult>

Parameters

suiteIdstringSuite ID

runAstringFirst run ID

runBstringSecond run ID

ReturnsPromise<EvalCompareResult>

evals.createThreshold

Set a minimum metric threshold with optional webhook callback.

async createThreshold(body: CreateEvalThresholdBody): Promise<EvalThreshold>

Parameters

suite_idstringSuite ID

metric'pass_rate' | 'avg_score'Metric to threshold

min_valuenumberMinimum acceptable value (0-1)

webhook_urlstringURL to call when threshold is breached

ReturnsPromise<EvalThreshold>

evals.listRuns

List eval runs with optional filters.

async listRuns(opts?: { suite_id?: string; agent_id?: string; status?: string }): Promise<EvalRun[]>

Parameters

suite_idstringFilter by suite

agent_idstringFilter by agent

ReturnsPromise<EvalRun[]>

Walkthrough

End-to-end eval workflow

Create a suite

Create an eval suite scoped to your agent.

typescript

const suite = await inv.evals.createSuite({
  name: 'Release Quality',
  agent_id: 'my-agent',
});

Add assertion cases

Add deterministic checks.

typescript

await inv.evals.createCase(suite.id, {
  name: 'No errors',
  type: 'assertion',
  assertion_config: { check: 'toHaveNoErrors' },
});

await inv.evals.createCase(suite.id, {
  name: 'Duration under 30s',
  type: 'assertion',
  assertion_config: { check: 'toHaveDurationBelow', params: { max_ms: 30000 } },
});

Add a judge case

Add LLM-scored quality check.

typescript

const scorer = await inv.scorers.create({
  name: 'Helpfulness',
  type: 'llm',
  config: {
    prompt: 'Rate the helpfulness of the agent response',
    criteria: ['accuracy', 'completeness', 'clarity'],
    model: 'claude-sonnet-4-20250514',
  },
});

await inv.evals.createCase(suite.id, {
  name: 'Helpfulness score',
  type: 'judge',
  scorer_id: scorer.id,
});

Run and compare

Run the suite and compare across versions.

typescript

const runA = await inv.evals.triggerRun(suite.id, {
  agent_id: 'my-agent',
  version_label: 'v1.0',
});

const runB = await inv.evals.triggerRun(suite.id, {
  agent_id: 'my-agent',
  version_label: 'v2.0',
});

const diff = await inv.evals.compare(suite.id, runA.id, runB.id);
console.log('Regressions:', diff.regressions);

Set a threshold

Gate deployments on minimum pass rate.

typescript

await inv.evals.createThreshold({
  suite_id: suite.id,
  metric: 'pass_rate',
  min_value: 0.9,
  webhook_url: 'https://hooks.example.com/deploy-gate',
});

Configuration

Option	Type	Default	Description
ANTHROPIC_API_KEY	env var	(none)	Required for Anthropic/Claude judge evals
OPENAI_API_KEY	env var	(none)	Required for OpenAI/GPT judge evals
EVAL_JUDGE_MOCK	env var	false	Set to "true" for deterministic mock scoring (no API calls)
api_key_env	ProviderTarget field	provider default	Custom env var name for API key in ProviderTarget
base_url_env	ProviderTarget field	provider default	Custom env var for base URL (OpenAI-compatible endpoints)

Error Handling

Error handling patterntypescript

try {
  const run = await inv.evals.triggerRun(suiteId, {
    agent_id: 'my-agent',
  });
} catch (err) {
  // API_ERROR if suite not found or agent has no sessions
  console.error(err.message);
}

Use Cases

Run assertion checks (no errors, chain integrity, duration limits) against every agent session
Score agent output quality with LLM judges using custom rubrics
Compare eval runs across agent versions to catch regressions before deployment
Set pass_rate and avg_score thresholds with webhook alerts for CI/CD gates
Track human review scores for subjective quality metrics

Queries & Replay

Datasets

On this page

Overview Architecture Important Notes Quick Example Type Definitions API Reference Walkthrough Configuration Error Handling Use Cases Related Modules