Run automated evaluations against agent sessions using assertion checks, LLM judges, and human scoring. Compare runs, set thresholds, and gate deployments.
The eval system lets you define test suites containing assertion-based and judge-based cases, then run them against agent sessions or trace data.
Assertion cases use 7 built-in checks: toHaveNoErrors, toContainAction, toHaveCount, toHaveCountAtLeast, toHaveDurationBelow, toHaveChainIntegrity, and toHaveAnomalyScoreBelow. These run deterministically against trace nodes.
Judge cases use an LLM (Claude or GPT) or human annotator to score agent behavior against custom criteria. Judge scoring is powered by Scorers — reusable configurations that define the rubric, model, and prompt.
Runs produce pass_rate and avg_score metrics. You can compare two runs to find regressions and improvements, and set thresholds with webhook callbacks to gate deployments.
Evals are managed through the REST API. Suites contain cases, runs execute cases against sessions, and results are persisted. The runner loads trace nodes, applies assertions or calls the judge, and computes aggregates. Thresholds are checked post-run and can trigger webhooks.
const inv = Invariance.init({ apiKey: process.env.INVARIANCE_API_KEY! });
// Create an eval suite
const suite = await inv.evals.createSuite({
name: 'Agent Quality',
description: 'Core quality checks',
agent_id: 'my-agent',
});
// Add an assertion case
await inv.evals.createCase(suite.id, {
name: 'No errors in trace',
type: 'assertion',
assertion_config: { check: 'toHaveNoErrors' },
});
// Add a judge case (requires a scorer)
await inv.evals.createCase(suite.id, {
name: 'Response quality',
type: 'judge',
scorer_id: 'scorer-id-here',
});
// Run the eval
const run = await inv.evals.triggerRun(suite.id, {
agent_id: 'my-agent',
version_label: 'v1.0',
target: { provider: 'anthropic', model: 'claude-sonnet-4-20250514' },
});
console.log(run.pass_rate, run.avg_score);interface EvalSuite {
id: string;
name: string;
description: string | null;
agent_id: string | null;
config: Record<string, unknown>;
case_count?: number;
latest_pass_rate?: number | null;
created_at: string;
updated_at: string;
}interface EvalCase {
id: string;
suite_id: string;
name: string;
type: 'assertion' | 'judge';
assertion_config: Record<string, unknown> | null;
judge_config: Record<string, unknown> | null;
weight: number;
scorer_id?: string | null;
created_at: string;
}interface EvalRun {
id: string;
suite_id: string;
agent_id: string;
version_label: string | null;
status: 'running' | 'completed' | 'failed';
total_cases: number;
passed_cases: number;
failed_cases: number;
pass_rate: number | null;
avg_score: number | null;
duration_ms: number | null;
metadata: Record<string, unknown>;
started_at: string;
completed_at: string | null;
}interface ProviderTarget {
provider: 'anthropic' | 'openai';
model: string;
api_key_env?: string;
base_url_env?: string;
}interface EvalCompareResult {
run_a: EvalRun;
run_b: EvalRun;
overall_delta: { pass_rate: number; avg_score: number };
per_case: Array<{
case_id: string;
case_name: string;
a_passed: boolean;
b_passed: boolean;
delta: number | null;
}>;
regressions: number;
improvements: number;
}const suite = await inv.evals.createSuite({
name: 'Release Quality',
agent_id: 'my-agent',
});await inv.evals.createCase(suite.id, {
name: 'No errors',
type: 'assertion',
assertion_config: { check: 'toHaveNoErrors' },
});
await inv.evals.createCase(suite.id, {
name: 'Duration under 30s',
type: 'assertion',
assertion_config: { check: 'toHaveDurationBelow', params: { max_ms: 30000 } },
});const scorer = await inv.scorers.create({
name: 'Helpfulness',
type: 'llm',
config: {
prompt: 'Rate the helpfulness of the agent response',
criteria: ['accuracy', 'completeness', 'clarity'],
model: 'claude-sonnet-4-20250514',
},
});
await inv.evals.createCase(suite.id, {
name: 'Helpfulness score',
type: 'judge',
scorer_id: scorer.id,
});const runA = await inv.evals.triggerRun(suite.id, {
agent_id: 'my-agent',
version_label: 'v1.0',
});
const runB = await inv.evals.triggerRun(suite.id, {
agent_id: 'my-agent',
version_label: 'v2.0',
});
const diff = await inv.evals.compare(suite.id, runA.id, runB.id);
console.log('Regressions:', diff.regressions);await inv.evals.createThreshold({
suite_id: suite.id,
metric: 'pass_rate',
min_value: 0.9,
webhook_url: 'https://hooks.example.com/deploy-gate',
});| Option | Type | Default | Description |
|---|---|---|---|
| ANTHROPIC_API_KEY | env var | (none) | Required for Anthropic/Claude judge evals |
| OPENAI_API_KEY | env var | (none) | Required for OpenAI/GPT judge evals |
| EVAL_JUDGE_MOCK | env var | false | Set to "true" for deterministic mock scoring (no API calls) |
| api_key_env | ProviderTarget field | provider default | Custom env var name for API key in ProviderTarget |
| base_url_env | ProviderTarget field | provider default | Custom env var for base URL (OpenAI-compatible endpoints) |
try {
const run = await inv.evals.triggerRun(suiteId, {
agent_id: 'my-agent',
});
} catch (err) {
// API_ERROR if suite not found or agent has no sessions
console.error(err.message);
}