Caliper Bench Leaderboard
A creative-writing benchmark for LLMs · prose craft, style, willingness
69 models · generated 2026-06-03 08:30 UTC ·
Methodology
·
Submit a model →
Search
Type
All
Base
Finetune
Merge
Proprietary
Reasoning
All
Reasoning
Standard
Size
All sizes
22-24B
27-36B
100-130B
300B+
API/anchor
Rank by
Creative Writing
Role Playing
Show all columns
⇄ Compare models
↑ higher is better
↓ lower is better
Click any column header to sort · hover for description
C0–C3
refusal rate ·
C2/C4
engagement rate ·
EngD
harm density on engaged refusable runs