v2.0 — Vision + Multi-Provider Support

BREVITY

Test Your API.

Deterministic evaluation for LLMs and multimodal models—on your own keys. No opinions. No subjectivity. Just reproducible, scientific scoring.

Start Benchmark View Leaderboard

75+

Models Benchmarked

200+

Tests per Suite

Metrics Tracked

Providers Supported

Privacy Guarantee

EPHEMERAL EXECUTION

Your API keys never touch our database. Every benchmark runs with complete isolation and automatic cleanup.

RAM-Only Passthrough

API keys are held only in memory during execution. Never stored on disk or in any database.

Automatic Log Redaction

All sensitive data is automatically masked in logs. Bearer tokens and API keys are never exposed.

Reproducibility Receipts

Get cryptographic receipts for every run including suite hash, decoding profile, and backend version.

HOW IT WORKS

From API key to comprehensive benchmark report in minutes.

Paste API Key

Select your provider and paste your API key. It stays in your browser—never stored.

Probe & Select

We verify capabilities like vision and strict JSON. Then you pick your model.

Run Benchmark

Watch tests execute in real-time with live metrics streaming to your dashboard.

Get Report

Receive a comprehensive score breakdown, charts, and exportable PDF certificate.

COMPREHENSIVE EVALUATION

Seven metrics, multiple tracks, and deterministic evaluation across every dimension that matters.

Logic & Reasoning

Tests for logical inference, trap questions, and needle-in-haystack retrieval.

Vision Capabilities

OCR extraction, object counting, and visual reasoning tests with real images.

Code Generation

Bug fixes, algorithm implementation, and code explanation tasks.

Format Compliance

Strict JSON generation, schema validation, and structured output tests.

Latency Tracking

P50 and P95 latency measurements with efficiency-weighted scoring.

Cost Analysis

Per-test and total cost tracking based on token usage.

Robustness Testing

Perturbation variants test model stability with noise injection.

Multi-Provider

Support for OpenAI, Anthropic, Gemini, and OpenAI-compatible APIs.

SCIENTIFIC SCORING

Every run produces a comprehensive metric vector for objective comparison.

results/run_abc123

Global Score

87.4

/ 100

Accuracy

94.2%

Robustness

89.1%

P50 Latency

342ms

Cost

$0.024

Accuracy

Robust

Format

Speed

Cost

Logic24 tests

91.2

Coding18 tests

88.5

Format12 tests

Vision8 tests

82.1

READY TO BENCHMARK YOUR MODELS?

No account required. Just your API key and 3 minutes to get your first benchmark report.

Start Now — It's Free