New benchmark

Choose participant and judge models, then run it.

Participant models
Selected: 0
Judge models
Selected: 0
Pairwise runs every model against every other model for each judge.
Judges will see this prompt plus participant outputs (blind to model identity).