Run evaluations
The arcade evals command discovers and executes evaluation suites with support for multiple providers, models, and output formats.
Backward compatibility: All new features (multi-provider support, capture mode, output formats) work with existing evaluation suites. No code changes required.
Basic usage
Run all evaluations in the current directory:
arcade evals .The command searches for files starting with eval_ and ending with .py.
Show detailed results with critic feedback:
arcade evals . --detailsFilter to show only failures:
arcade evals . --only-failedMulti-provider support
Single provider with default model
Use OpenAI with default model (gpt-4o):
export OPENAI_API_KEY=sk-...
arcade evals .Use Anthropic with default model (claude-sonnet-4-5-20250929):
export ANTHROPIC_API_KEY=sk-ant-...
arcade evals . --use-provider anthropicSpecific models
Specify one or more models for a provider:
arcade evals . --use-provider "openai:gpt-4o,gpt-4o-mini"Multiple providers
Compare performance across providers (space-separated):
arcade evals . \
--use-provider "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929" \
--api-key openai:sk-... \
--api-key anthropic:sk-ant-...When you specify multiple models, results show side-by-side comparisons.
API keys
are resolved in the following order:
| Priority | Format |
|---|---|
| 1. Explicit flag | --api-key provider:key (can repeat) |
| 2. Environment | OPENAI_API_KEY, ANTHROPIC_API_KEY |
3. .env file | OPENAI_API_KEY=..., ANTHROPIC_API_KEY=... |
Create a .env file in your directory to avoid setting keys in every terminal session.
Examples:
# Single provider
arcade evals . --api-key openai:sk-...
# Multiple providers
arcade evals . \
--api-key openai:sk-... \
--api-key anthropic:sk-ant-...Capture mode
Record calls without scoring to bootstrap test expectations:
arcade evals . --capture --output captures/baseline.jsonInclude conversation in captured output:
arcade evals . --capture --include-context --output captures/detailed.jsonCapture mode is useful for:
- Creating initial test expectations
- Debugging model behavior
- Understanding call patterns
See Capture mode for details.
Output formats
Save results to files
Specify output files with extensions - format is auto-detected:
# Single format
arcade evals . --output results.md
# Multiple formats
arcade evals . --output results.md --output results.html --output results.json
# All formats (no extension)
arcade evals . --output resultsAvailable formats
| Extension | Format | Description |
|---|---|---|
.txt | Plain text | Pytest-style output |
.md | Markdown | Tables and collapsible sections |
.html | HTML | Interactive report |
.json | JSON | Structured data for programmatic use |
| (none) | All formats | Generates all four formats |
Command options
Quick reference
| Flag | Short | Purpose | Example |
|---|---|---|---|
--use-provider | -p | Select provider/model | -p "openai:gpt-4o" |
--api-key | -k | Provider API key | -k openai:sk-... |
--capture | - | Record without scoring | --capture |
--details | -d | Show critic feedback | --details |
--only-failed | -f | Filter failures | --only-failed |
--output | -o | Output file(s) | -o results.md |
--include-context | - | Add messages to output | --include-context |
--max-concurrent | -c | Parallel limit | -c 10 |
--debug | - | Debug info | --debug |
--use-provider, -p
Specify provider(s) and model(s) (space-separated):
--use-provider "<provider>[:<models>] [<provider2>[:<models2>]]"Supported providers:
openai(default:gpt-4o)anthropic(default:claude-sonnet-4-5-20250929)
Anthropic model names include date stamps. Check Anthropic’s model documentation for the latest model versions.
Examples:
# Default model for provider
arcade evals . -p anthropic
# Specific model
arcade evals . -p "openai:gpt-4o-mini"
# Multiple models from same provider
arcade evals . -p "openai:gpt-4o,gpt-4o-mini"
# Multiple providers (space-separated)
arcade evals . -p "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929"--api-key, -k
Provide explicitly (repeatable):
arcade evals . -k openai:sk-... -k anthropic:sk-ant-...--capture
Enable capture mode to record calls without scoring:
arcade evals . --capture--include-context
Include system messages and conversation history in output:
arcade evals . --include-context --output results.md--output, -o
Specify output file(s). Format is auto-detected from extension:
# Single format
arcade evals . -o results.md
# Multiple formats (repeat flag)
arcade evals . -o results.md -o results.html
# All formats (no extension)
arcade evals . -o results--details, -d
Show detailed results including critic feedback:
arcade evals . --details--only-failed, -f
Show only failed test cases:
arcade evals . --only-failed--max-concurrent, -c
Set maximum concurrent evaluations:
arcade evals . --max-concurrent 10Default is 1 concurrent evaluation.
--debug
Show debug information for troubleshooting:
arcade evals . --debugDisplays detailed error traces and connection information.
Understanding results
Results are formatted based on evaluation type (regular, multi-model, or comparative) and selected flags.
Summary format
Results show overall performance:
Summary -- Total: 5 -- Passed: 4 -- Failed: 1How flags affect output:
--details: Adds per-critic breakdown for each case--only-failed: Filters to show only failed cases (summary shows original totals)--include-context: Includes system messages and conversation history- Multiple models: Switches to comparison table format
- Comparative tracks: Shows side-by-side track comparison
Case results
Each case displays status and score:
PASSED Get weather for city -- Score: 1.00
FAILED Weather with invalid city -- Score: 0.65Detailed feedback
Use --details to see critic-level analysis:
Details:
location:
Match: False, Score: 0.00/0.70
Expected: Seattle
Actual: Seatle
units:
Match: True, Score: 0.30/0.30Multi-model results
When using multiple models, results show comparison tables:
Case: Get weather for city
Model: gpt-4o -- Score: 1.00 -- PASSED
Model: gpt-4o-mini -- Score: 0.95 -- WARNEDAdvanced usage
High concurrency for fast execution
Increase concurrent evaluations:
arcade evals . --max-concurrent 20High concurrency may hit API rate limits. Start with default (1) and increase gradually.
Save comprehensive results
Generate all formats with full details:
arcade evals . --details --include-context --output resultsThis creates:
results.txtresults.mdresults.htmlresults.json
Troubleshooting
Missing dependencies
If you see ImportError: MCP SDK is required, install the full package:
pip install 'arcade-mcp[evals]'For Anthropic support:
pip install anthropicTool name mismatches
names are normalized (dots become underscores). Check your tool definitions if you see unexpected names.
API rate limits
Reduce --max-concurrent value:
arcade evals . --max-concurrent 2No evaluation files found
Ensure your evaluation files:
- Start with
eval_ - End with
.py - Contain functions decorated with
@tool_eval()
Next steps
- Explore capture mode for recording calls
- Learn about comparative evaluations for comparing sources