BREVITY
Test Your API.
Deterministic evaluation for LLMs and multimodal models—on your own keys. No opinions. No subjectivity. Just reproducible, scientific scoring.
EPHEMERAL EXECUTION
Your API keys never touch our database. Every benchmark runs with complete isolation and automatic cleanup.
RAM-Only Passthrough
API keys are held only in memory during execution. Never stored on disk or in any database.
Automatic Log Redaction
All sensitive data is automatically masked in logs. Bearer tokens and API keys are never exposed.
Reproducibility Receipts
Get cryptographic receipts for every run including suite hash, decoding profile, and backend version.
HOW IT WORKS
From API key to comprehensive benchmark report in minutes.
Paste API Key
Select your provider and paste your API key. It stays in your browser—never stored.
Probe & Select
We verify capabilities like vision and strict JSON. Then you pick your model.
Run Benchmark
Watch tests execute in real-time with live metrics streaming to your dashboard.
Get Report
Receive a comprehensive score breakdown, charts, and exportable PDF certificate.
COMPREHENSIVE EVALUATION
Seven metrics, multiple tracks, and deterministic evaluation across every dimension that matters.
Logic & Reasoning
Tests for logical inference, trap questions, and needle-in-haystack retrieval.
Vision Capabilities
OCR extraction, object counting, and visual reasoning tests with real images.
Code Generation
Bug fixes, algorithm implementation, and code explanation tasks.
Format Compliance
Strict JSON generation, schema validation, and structured output tests.
Latency Tracking
P50 and P95 latency measurements with efficiency-weighted scoring.
Cost Analysis
Per-test and total cost tracking based on token usage.
Robustness Testing
Perturbation variants test model stability with noise injection.
Multi-Provider
Support for OpenAI, Anthropic, Gemini, and OpenAI-compatible APIs.
SCIENTIFIC SCORING
Every run produces a comprehensive metric vector for objective comparison.
READY TO BENCHMARK YOUR MODELS?
No account required. Just your API key and 3 minutes to get your first benchmark report.
Start Now — It's Free