v2.0 — Vision + Multi-Provider Support

BREVITY

Test Your API.

Deterministic evaluation for LLMs and multimodal models—on your own keys. No opinions. No subjectivity. Just reproducible, scientific scoring.

75+
Models Benchmarked
200+
Tests per Suite
7
Metrics Tracked
5+
Providers Supported
Privacy Guarantee

EPHEMERAL EXECUTION

Your API keys never touch our database. Every benchmark runs with complete isolation and automatic cleanup.

RAM-Only Passthrough

API keys are held only in memory during execution. Never stored on disk or in any database.

Automatic Log Redaction

All sensitive data is automatically masked in logs. Bearer tokens and API keys are never exposed.

Reproducibility Receipts

Get cryptographic receipts for every run including suite hash, decoding profile, and backend version.

HOW IT WORKS

From API key to comprehensive benchmark report in minutes.

01

Paste API Key

Select your provider and paste your API key. It stays in your browser—never stored.

02

Probe & Select

We verify capabilities like vision and strict JSON. Then you pick your model.

03

Run Benchmark

Watch tests execute in real-time with live metrics streaming to your dashboard.

04

Get Report

Receive a comprehensive score breakdown, charts, and exportable PDF certificate.

COMPREHENSIVE EVALUATION

Seven metrics, multiple tracks, and deterministic evaluation across every dimension that matters.

Logic & Reasoning

Tests for logical inference, trap questions, and needle-in-haystack retrieval.

Vision Capabilities

OCR extraction, object counting, and visual reasoning tests with real images.

Code Generation

Bug fixes, algorithm implementation, and code explanation tasks.

Format Compliance

Strict JSON generation, schema validation, and structured output tests.

Latency Tracking

P50 and P95 latency measurements with efficiency-weighted scoring.

Cost Analysis

Per-test and total cost tracking based on token usage.

Robustness Testing

Perturbation variants test model stability with noise injection.

Multi-Provider

Support for OpenAI, Anthropic, Gemini, and OpenAI-compatible APIs.

SCIENTIFIC SCORING

Every run produces a comprehensive metric vector for objective comparison.

results/run_abc123
Global Score
87.4
/ 100
Accuracy
94.2%
Robustness
89.1%
P50 Latency
342ms
Cost
$0.024
Accuracy
Robust
Format
Speed
Cost
Logic24 tests
91.2
Coding18 tests
88.5
Format12 tests
95
Vision8 tests
82.1

READY TO BENCHMARK YOUR MODELS?

No account required. Just your API key and 3 minutes to get your first benchmark report.

Start Now — It's Free