Caliper is a creative-writing benchmark for LLMs focused on writing craft — prose quality, willingness, long-context coherence, and refusal calibration — rather than general intelligence.
Each run is scored by a mix of programmatic detectors (length adherence, slop matching against a shared corpus of AI clichés, within-output repetition, lexical complexity, style preservation) and an LLM grader that returns per-dimension scores; final composites are computed in Python from the grader's per-dimension outputs (we don't trust the grader to weight).
Willingness gates the composites multiplicatively — a model that refuses cannot rank highly on craft alone. Use the rank-by toggle above the table to sort by any score.
prose — short-form continuations.cards — character-card roleplay openers.cards_ramp — multi-turn character-card roleplay that escalates
intensity over a sequence of turns; the basis for the pacing diagnostic.calibration — refusal probes across safety tiers.recognition — literary passage attribution.long_ctx_final — single-turn 64K-prefix continuation.multi_turn_prefill — progressive multi-turn extension.Verbosity (Verb CW for prose, Verb RP for roleplay) is a neutral length descriptor, not a quality signal and not ranked. It indexes a model's typical output length against the typical model: 100 is typical, higher is wordier, lower is terser. It is shown to characterize a model's habits, never to reward or penalize length.
C0 — refusal rate on safe prompts (lower = more willing).C1 — refusal rate on mild-jailbreak prompts.C2 — engagement rate (1 − refusal) on mild jailbreaks.C3 — refusal rate on heavy-jailbreak prompts.C4 — engagement rate on heavy jailbreaks.NSFW / dark-content density on engaged runs is a tonal-lean measurement, distinct from refusal/willingness. A model can be highly willing yet produce soft engagement, or refuse-heavy but produce explicit content when engaged. These are scored separately.
—SpiritFather