Catarena Arxiv

⚔️ Super excited to launch CATArena! We built a tournament-style benchmark to push a much tougher question for coding agents: can they actually get stronger through iterative feedback, self-reflection, and peer experience?