FlagEval evaluation platform

FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.

🌟 FlagEval Core

Project	Scope	GitHub
FlagEval	General‑purpose evaluation toolkit & platform for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audio	https://github.com/flageval-baai/FlagEval

🚀 Satellite Repositories

Project	Description	GitHub
FlagEvalMM	Flexible framework for comprehensive multimodal model evaluation across text, image, and video tasks	https://github.com/flageval-baai/FlagEvalMM
SeniorTalk	55 h Mandarin speech dataset featuring 202 elderly speakers (75‑85 yrs) with rich annotations	https://github.com/flageval-baai/SeniorTalk
ChildMandarin	41 h child speech dataset covering 397 speakers (3‑5 yrs), balanced by gender & region	https://github.com/flageval-baai/ChildMandarin
HalluDial	Large‑scale dialogue hallucination benchmark (spontaneous + induced scenarios, 147 k turns)	https://github.com/flageval-baai/HalluDial
CMMU	IJCAI‑24 Chinese Multimodal Multi‑type Question benchmark (3 603 exam‑style Q&A)	https://github.com/flageval-baai/CMMU

📚 Repository Matrix

Repo	Highlights	Why It Matters	License
FlagEval	NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter	One‑stop hub for model & algorithm benchmarking	Apache‑2.0
FlagEvalMM	Multimodal eval harness with vLLM/SGLang adapters	Ready for GPT‑4o era, supports batch eval	Apache‑2.0
SeniorTalk	Elderly speech corpus	Enables ASR/TTS for super‑aged population	CC BY‑NC‑SA 4.0
ChildMandarin	Child speech corpus	Complements SeniorTalk, spans lifespan	CC BY‑NC‑SA 4.0
HalluDial	Dialogue hallucination dataset & metrics	First large‑scale hallucination localization benchmark	Apache‑2.0
CMMU	Multimodal Q&A exam	Stress‑tests domain knowledge & reasoning	MIT

🔭 Roadmap (2025‑2026)

Continuous Benchmarking: nightly runs on FlagScale with automated PR badges and regression alerts.
Community Challenges: quarterly leaderboard sprints to surface emerging research directions.

🤝 Contributing

We welcome issues & PRs! Please check each project’s CONTRIBUTING.md and adhere to its license terms.

📄 Citation

If you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.

🛡️ License

This meta‑repository is released under Apache‑2.0. Individual projects may apply different licenses—see their respective READMEs.

Maintained by the FlagEval team · Last updated: 2025‑04‑23

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlagEval evaluation platform

🌟 FlagEval Core

🚀 Satellite Repositories

📚 Repository Matrix

🔭 Roadmap (2025‑2026)

🤝 Contributing

📄 Citation

🛡️ License

About

Releases

Packages

FlagOpen/FlagEval

Folders and files

Latest commit

History

Repository files navigation

FlagEval evaluation platform

🌟 FlagEval Core

🚀 Satellite Repositories

📚 Repository Matrix

🔭 Roadmap (2025‑2026)

🤝 Contributing

📄 Citation

🛡️ License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages