New benchmark

Choose participant and judge models, then run it.

Name

Participant models

openai/gpt-5.2openai/gpt-5-minianthropic/claude-3.5-sonnetgoogle/gemini-3-pro-previewgoogle/gemini-3-flash-previewx-ai/grok-4.1-fastdeepseek/deepseek-v3.2moonshotai/kimi-k2-thinkinganthropic/claude-sonnet-4.5xiaomi/mimo-v2-flash:freeminimax/minimax-m2.1

Selected: 0

Judge models

openai/gpt-5.2openai/gpt-5-minianthropic/claude-3.5-sonnetgoogle/gemini-3-pro-previewgoogle/gemini-3-flash-previewx-ai/grok-4.1-fastdeepseek/deepseek-v3.2moonshotai/kimi-k2-thinkinganthropic/claude-sonnet-4.5xiaomi/mimo-v2-flash:freeminimax/minimax-m2.1

Selected: 0

Grading mode

Pairwise runs every model against every other model for each judge.

Participant system prompt (optional)

Judge system prompt (optional)

Participant user prompt (required)

Judges will see this prompt plus participant outputs (blind to model identity).