Skip to content

Question: How to run itas algorithm for each benchmark besides mt_bench and arena? For example gsm8k? #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
oandreeva-nv opened this issue Feb 4, 2025 · 1 comment

Comments

@oandreeva-nv
Copy link

oandreeva-nv commented Feb 4, 2025

[Note: edited for clarification]

Dear authors,

I was trying to run ITAS algorithm for GSM8K benchmark to get a task specific ARCHON architecture. Unfortunately, I'm a bit stuck with unsupported benchmark issues.

I can see that provided scripts under benchmarks/ and benchmarks/gsm8k repos can generate and evaluate answers.
Unfortunately, it seems like itas_algorithm script in current released version supports only "mt_bench" and "arena_hard_auto":

if self.search_config["benchmark"] in ["mt_bench", "arena_hard_auto"]:

Please, let me know if I'm wrong and what steps are necessary to get a task specific ARCHON architecture.

My intuition leads me to the fact that I need to add question map to use in power_ranker:

QUESTION_MAP = {
"arena_hard_auto": "archon/benchmarks/arena_hard_auto/arena_questions.jsonl",
"mt_bench": "archon/benchmarksmt_bench/FastChat/fastchat/llm_judge/data/mt_bench/question.jsonl",
}

as well as add some logic to compare generated answer against a correct one. Is my intuition correct? Do you plan to update the code with this logic by any chance?

Thanks in advance!

@shloknatarajan
Copy link
Collaborator

You're correct; at this point in time, Archon only supports arena hard auto and mt_bench for sampling. The brunt of the work is getting Power Ranker to support new benchmarks since ITAS relies on Power Ranker to decide what configurations work best. We don't have a current timeline in mind for supporting other benchmarks, but it is something that's on the agenda. If you do implement this for your own use case and put up a PR that would definitely help us get the integration working sooner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants