Skip to content

FlagOpen/FlagEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

FlagEval evaluation platform

FlagEval Logo


FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.

🌟 FlagEval Core

Project Scope GitHub
FlagEval General‑purpose evaluation toolkit & platform for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audio https://github.com/flageval-baai/FlagEval

🚀 Satellite Repositories

Project Description GitHub
FlagEvalMM Flexible framework for comprehensive multimodal model evaluation across text, image, and video tasks https://github.com/flageval-baai/FlagEvalMM
SeniorTalk 55 h Mandarin speech dataset featuring 202 elderly speakers (75‑85 yrs) with rich annotations https://github.com/flageval-baai/SeniorTalk
ChildMandarin 41 h child speech dataset covering 397 speakers (3‑5 yrs), balanced by gender & region https://github.com/flageval-baai/ChildMandarin
HalluDial Large‑scale dialogue hallucination benchmark (spontaneous + induced scenarios, 147 k turns) https://github.com/flageval-baai/HalluDial
CMMU IJCAI‑24 Chinese Multimodal Multi‑type Question benchmark (3 603 exam‑style Q&A) https://github.com/flageval-baai/CMMU

📚 Repository Matrix

Repo Highlights Why It Matters License
FlagEval NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter One‑stop hub for model & algorithm benchmarking Apache‑2.0
FlagEvalMM Multimodal eval harness with vLLM/SGLang adapters Ready for GPT‑4o era, supports batch eval Apache‑2.0
SeniorTalk Elderly speech corpus Enables ASR/TTS for super‑aged population CC BY‑NC‑SA 4.0
ChildMandarin Child speech corpus Complements SeniorTalk, spans lifespan CC BY‑NC‑SA 4.0
HalluDial Dialogue hallucination dataset & metrics First large‑scale hallucination localization benchmark Apache‑2.0
CMMU Multimodal Q&A exam Stress‑tests domain knowledge & reasoning MIT

🔭 Roadmap (2025‑2026)

  1. Continuous Benchmarking: nightly runs on FlagScale with automated PR badges and regression alerts.
  2. Community Challenges: quarterly leaderboard sprints to surface emerging research directions.

🤝 Contributing

We welcome issues & PRs! Please check each project’s CONTRIBUTING.md and adhere to its license terms.


đź“„ Citation

If you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.


🛡️ License

This meta‑repository is released under Apache‑2.0. Individual projects may apply different licenses—see their respective READMEs.


Maintained by the FlagEval team · Last updated: 2025‑04‑23

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published