BIG-bench
Beyond the Imitation Game collaborative benchmark for measuring
...The suite provides a common JSON task format and an evaluation harness so research groups can contribute new tasks and reproduce results consistently. It emphasizes robustness analysis—looking at scale trends, calibration, and areas where models systematically fail—to guide model development beyond raw accuracy. BIG-bench is as much a community process as a dataset, encouraging open sharing of tasks and findings to keep evaluations fresh and comprehensive.