Methodology

Caliper is a creative-writing benchmark for LLMs focused on writing craft — prose quality, willingness, long-context coherence, and refusal calibration — rather than general intelligence.

Scoring stack

Each run is scored by a mix of programmatic detectors (length adherence, slop matching against a shared corpus of AI clichés, within-output repetition, lexical complexity, style preservation) and an LLM grader that returns per-dimension scores; final composites are computed in Python from the grader's per-dimension outputs (we don't trust the grader to weight).

Headline scores

CW (creative writing) — one-shot prose craft. An LLM grader rates hook, voice, imagery, coherence, and instruction adherence, combined with style, craft mechanics, and within-output consistency, gated by creative willingness.
RP (roleplay) — the headline roleplay score, combining short-context character RP with long-context coherence. Frontier anchors with no long-context data show their plain Short Ctx.
Short Ctx — sustained in-character quality on original character cards: in-character hold, initiative, escalation pacing, anatomy, style, and concreteness, gated by willingness and a persona-break dealbreaker. Because the personas are original, this runs on every provider — it is the common basis on which all models are scored identically.
Long Ctx (long-context coherence) — does the model hold voice, lore/facts, and coherence while continuing a 64K prefix? Reported for open-weight / community models. Frontier anchors are N/A.
Pace — a diagnostic for escalation pacing in roleplay, rewarding a measured intensity trajectory over flat or runaway responses.

Willingness gates the composites multiplicatively — a model that refuses cannot rank highly on craft alone. Use the rank-by toggle above the table to sort by any score.

Modes

prose — short-form continuations.
cards — character-card roleplay openers.
cards_ramp — multi-turn character-card roleplay that escalates intensity over a sequence of turns; the basis for the pacing diagnostic.
calibration — refusal probes across safety tiers.
recognition — literary passage attribution.
long_ctx_final — single-turn 64K-prefix continuation.
multi_turn_prefill — progressive multi-turn extension.

Verbosity

Verbosity (Verb CW for prose, Verb RP for roleplay) is a neutral length descriptor, not a quality signal and not ranked. It indexes a model's typical output length against the typical model: 100 is typical, higher is wordier, lower is terser. It is shown to characterize a model's habits, never to reward or penalize length.

Refusal tiers (C0–C4)

C0 — refusal rate on safe prompts (lower = more willing).
C1 — refusal rate on mild-jailbreak prompts.
C2 — engagement rate (1 − refusal) on mild jailbreaks.
C3 — refusal rate on heavy-jailbreak prompts.
C4 — engagement rate on heavy jailbreaks.

Lean vs willingness

NSFW / dark-content density on engaged runs is a tonal-lean measurement, distinct from refusal/willingness. A model can be highly willing yet produce soft engagement, or refuse-heavy but produce explicit content when engaged. These are scored separately.

—SpiritFather