[EVAL] Big-Bench Extra Hard (BBEH) #600

lewtun · 2025-03-03T15:31:56Z

Evaluation short description

Google has releases BBEH as a way to compensate for the saturation of BBH in the latest generation of LLMs. Overall looks like a good benchmark to probe reasoning capabilities.

Evaluation metadata

Provide all available

Paper url: https://arxiv.org/pdf/2502.19187
Github url: https://github.com/google-deepmind/bbeh
Dataset url:

lewtun added the new task label Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EVAL] Big-Bench Extra Hard (BBEH) #600

[EVAL] Big-Bench Extra Hard (BBEH) #600

lewtun commented Mar 3, 2025

[EVAL] Big-Bench Extra Hard (BBEH) #600

[EVAL] Big-Bench Extra Hard (BBEH) #600

Comments

lewtun commented Mar 3, 2025

Evaluation short description

Evaluation metadata