Open eval leaderboard + CI gate for autonomous coding agents (solve, score, trace).
An open, always-on leaderboard and CI gate for autonomous coding agents — every patch runs in a sandbox, every run has a public trace, every regression fails the build. ▶ Live leaderboard: forgejudge.ahmedhobeishy.tech · playground · methodology · model swap · MCP registry Current numbers (hidden-test = the agent never sees the failing test; $0 free tier; same harness, swap the model; 18 tasks ×…
Verification confirms publisher identity (repo ownership), not code safety. The security scan covers known CVEs and suspicious install scripts — it cannot prove the absence of malicious code.
An open, always-on leaderboard and CI gate for autonomous coding agents — every patch runs in a sandbox, every run has a public trace, every regression fails the build. ▶ Live leaderboard: forgejudge.ahmedhobeishy.tech · playground · methodology · model swap · MCP registry Current numbers (hidden-test = the agent never sees the failing test; $0 free tier; same harness, swap the model; 18 tasks × 3 seeds = 54 runs/model, 162 total): | Model | pass@1 | pass@3 | The score rises with the better…