Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removed hf token from cpu based example #464

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nirrozenbaum
Copy link
Contributor

trying to run the example without configuring the HF token seems to work, so this part can be removed.

CC @danehans

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nirrozenbaum
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from danehans and Jeffwan March 9, 2025 10:41
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 9, 2025
Copy link

netlify bot commented Mar 9, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit c74605d
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67d29f2f2c062d00086934e7
😎 Deploy Preview https://deploy-preview-464--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@nirrozenbaum nirrozenbaum mentioned this pull request Mar 9, 2025
@@ -29,11 +29,6 @@ spec:
env:
- name: PORT
value: "8000"
- name: HUGGING_FACE_HUB_TOKEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we set memory and cpu requirements? what is the minimum amount of memory needed to load the model?

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we set memory and cpu requirements? what is the minimum amount of memory needed to load the model?

@ahg-g I've tested the cpu and memory utilization during the loading of the model and during the inference.
for loading the model 1.5 cpu was enough (though a bit slow, but overall seems fine). the problem was the even the simplest inference request like "say hi" takes forever.

I've tested this simple "say hi" inference request and found the following -
when I didn't limit the cpu usage, it used 13 cpu and I got an answer within 1 second.
I limited the cpu usage to 5 CPUs and got an answer in 58 seconds.

when using less than 5 CPUs I was waiting forever on a very simple inference request which doesn't seem reasonable.

If I had to decide on a number I would put no less than 8, since we do want to show reasonable performance and not inference that takes forever.

let me know what you think and I'll push this change.
I made few more configuration changes, but this is the one that's most important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding more numbers to get more or less understanding:
using 8 CPUs I see the following:
a simple "say hi" inference request took 29 seconds.
the curl command that is written in the quickstart guide (with San Fransisco) took almost 3 minutes.
so maybe even 8 is not enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this as a comment near the resources definition? This is likely useful for a first timer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document the measurements or the required cpu/memory I’ve added?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both I think, documenting the why behind the required resources might help folks make informed decisions in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfswain I've pushed a commit with more documentation

@ahg-g
Copy link
Contributor

ahg-g commented Mar 9, 2025

I tried the deployment-cpu.yaml, the container crashloops and I am getting the following error:

":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 03-09 20:41:57 cpu.py:39] Cannot use None backend on CPU.
INFO 03-09 20:41:57 cpu.py:40] Using Torch SDPA backend.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

@nirrozenbaum
Copy link
Contributor Author

nirrozenbaum commented Mar 10, 2025

@ahg-g I pushed a commit with cpu and memory requirements.
I've done a lot of trial and error and this setup seems to give reasonable response time. of course the more memory and cpu we allocate the better performance we can get, but these values seems good enough for me.
BTW, we can reduce the number of replicas from 3 to 2 to reduce the requirements.. but no doubt it has to be a strong cluster and not small one.

memory: "9000Mi"
requests:
cpu: "12"
memory: "9000Mi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the adapter-loader initContainer? we removed that for the gpu deployment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when removing the init container I'm getting this error about missing adapter:

INFO 03-13 07:42:00 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 4.57x
INFO 03-13 07:42:02 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.30 seconds
INFO 03-13 07:42:02 api_server.py:756] Using supplied chat template:
INFO 03-13 07:42:02 api_server.py:756] None
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 879, in run_server
    await init_app_state(engine_client, model_config, app.state, args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 765, in init_app_state
    await state.openai_serving_models.init_static_loras()
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_models.py", line 96, in init_static_loras
    raise ValueError(load_result.message)
ValueError: Loading lora tweet-summary-0 failed: No adapter found for /adapters/hub/models--ai-blond--Qwen-Qwen2.5-Coder-1.5B-Instruct-lora/snapshots/9cde18d8ed964b0519fb481cca6acd936b2ca811
Nirs-MBP-2:gateway-api-inference-extension nirro$ 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the solution is to have the adapters in the flags above directly point to HF, right now they are pointing to the volume that is being created and populated by the side car, which is not necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls see the gpu deployment as an example.

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still getting the following error:

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 13, 2025
@nirrozenbaum
Copy link
Contributor Author

nirrozenbaum commented Mar 13, 2025

I am still getting the following error:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in uvloop.run(run_server(args)) File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 61, in wrapper return await main File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause.

that's interesting. I'm failing to understand from this log what could be the issue.
@kfswain @danehans @liu-cong can you guys try to deploy the cpu-deployment from this PR (which contains cpu and memory specifications) in your cluster and update if it is working for you?

Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 13, 2025
@nirrozenbaum
Copy link
Contributor Author

@ahg-g I've confirmed that this deployment is working on two additional clusters.
can you please give it another try and verify you used the deployment version of this PR?
in addition, it would be very helpful if someone else can try to run it.

@nirrozenbaum
Copy link
Contributor Author

and btw, on another cluster I used, I was able to get reasonable response times for inference requests when using only 4 CPUs per pod.

@ahg-g
Copy link
Contributor

ahg-g commented Mar 17, 2025

Update: this worked for me, including the removal of the HF token. As I mentioned in the other PR, the issue was the CPU model, I was using AMD, but when switching to intel it worked. We need to document that somewhere or make the image multi-arch. Thanks @nirrozenbaum!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants