[API server] handle logs request in coroutine #5366

aylei · 2025-04-25T10:01:59Z

This PR includes the minimal changes that move /logs handling to coroutine:

introduce a coroutine context, which handles cancellation, log redirection and env var overrides;
run /logs in uvicorn's event loop;

Though the task is now executed directly in the unvicorn process, we still maintain a request record for logs request to keep the behavior consistent: user can still cancel a log request sky api cancel and retrieve the log again with sky api logs.

Follow ups:

same approach for sky jobs log
[API server] ctrl-c sky logs does not cancel the logs request on server #5165
make skypilot config contextual

Benchmark

Command: python tests/load_tests/test_load_on_server.py -n 100 --apis tail_logs -c kubernetes under low server concurrency, 1c2g machine (1 long workers + 2 short workers):

# This PR
All requests completed in 16.20 seconds

----------------------------------------------------------------------------------------------------
Kind                 Count    Total(s)   Avg(s)     Min(s)     Max(s)     P95(s)     P99(s)
----------------------------------------------------------------------------------------------------
API /tail_logs       100      1642.00    16.42      16.20      18.26      17.25      18.26

# Master
All requests completed in 229.30 seconds

Latency Statistics:
----------------------------------------------------------------------------------------------------
Kind                 Count    Total(s)   Avg(s)     Min(s)     Max(s)     P95(s)     P99(s)
----------------------------------------------------------------------------------------------------
API /tail_logs       100      11675.38   116.75     3.31       229.30     218.19     227.03

There is a 7x improvement in average. The bottleneck of this PR is that each log task runs in a dedicated thread and there is only 1 uvicorn worker process, GIL contention makes the 100 logs threads cannot be fully concurrent.

Command: python tests/load_tests/test_load_on_server.py -n 100 --apis tail_logs -c aws under unlimited concurrency local mode (burstable worker), 4c16g machine:

# This PR
All requests completed in 56.22 seconds

Latency Statistics:
----------------------------------------------------------------------------------------------------
Kind                 Count    Total(s)   Avg(s)     Min(s)     Max(s)     P95(s)     P99(s)
----------------------------------------------------------------------------------------------------
API /tail_logs       100      5367.31    53.67      53.45      56.06      53.76      56.04

# Master
All requests completed in 90.30 seconds

Latency Statistics:
----------------------------------------------------------------------------------------------------
Kind                 Count    Total(s)   Avg(s)     Min(s)     Max(s)     P95(s)     P99(s)
----------------------------------------------------------------------------------------------------
API /tail_logs       100      7838.55    78.39      43.78      90.20      90.00      90.10

Resources:

# This PR
PEAK USAGE:
Peak CPU: 100.0%
Peak Memory: 1.50GB (11.8%)
Memory Delta: 0.6GB
Peak Short Executor Memory: 0.16GB
Peak Short Executor Memory Average: 0.16GB
Peak Long Executor Memory: 0.00GB
Peak Long Executor Memory Average: 0.00GB

# Master
PEAK USAGE:
Peak CPU: 100.0%
Peak Memory: 7.92GB (53.7%)
Memory Delta: 6.7GB
Peak Short Executor Memory: 0.20GB
Peak Short Executor Memory Average: 0.18GB
Peak Long Executor Memory: 0.19GB
Peak Long Executor Memory Average: 0.19GB

About 10x memory efficiency. However, the test found that logs on aws instance is significantly slower than logs on kubernetes instance (I switch the benchmark env to AWS EC2 for accurate resource usage accounting). This might be related to more RPCs/CPU cycles touched by the AWS code path, I leave this as a followup as it is not actually relevant to this PR.

Tests

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Signed-off-by: Aylei <[email protected]>

aylei · 2025-04-27T13:06:26Z

/smoke-test -k test_minimal

aylei · 2025-04-28T08:07:05Z

/smoke-test

cg505

Wow, awesome work @aylei! Left a handful of comments - mostly questions/asking for some clarification comments

cg505 · 2025-04-28T23:34:26Z

sky/utils/context_utils.py

+            break
+
+
+def cancellation_guard(func):


Can we make sure this still works with the type checker? That is, it doesn't erase the types of functions it is applied to. We have this issue with some decorators.

Yeah, I will take a look.

cg505 · 2025-04-28T23:38:46Z

sky/server/server.py


+    background_tasks.add_task(cancel_task)


This is pretty confusing - wouldn't this immediately cancel the task? Is this deferred for some reason? Can we leave a comment?

From reading the BackgroundTasks doc, I see that "background tasks" don't actually start until after the request is finished. (Contrary to how it sounds, that they would immediately begin in the background.) Still, could we clarify in a comment?

Sure, good catch!

cg505 · 2025-04-28T23:51:13Z

sky/utils/env_options.py

+        ctx = context.get()
+        if ctx is not None:
+            v = ctx.getenv(self.env_var, str(self.default))
+        else:
+            v = os.getenv(self.env_var, str(self.default))


maybe we can make this entire block into a common helper function.
also, can we check other uses of getenv? I'm a bit concerned especially about the config env vars

cg505 · 2025-04-28T23:56:25Z

sky/utils/context_utils.py

+        futs.append(
+            pool.apply_async(pipe,
+                             (proc.stdout, ctx.output_stream(sys.stdout))))
+        if proc.stderr is not None:
+            futs.append(
+                pool.apply_async(pipe,
+                                 (proc.stderr, ctx.output_stream(sys.stderr))))


Shouldn't we update process_subprocess_stream? Seems like both that code and this code are reading from proc.stdout / proc.stderr, which I'd expect to cause issues. No?

Good catch! It works coincidentally because the code path of sky logs always set process_stream=False so process_subprocess_stream and pipe_and_wait_process will never be called in the same time in current code base. I did not update process_subprocess_stream because it can only runs in main thread now

skypilot/sky/skylet/log_lib.py

Line 127 in 7b804da

# Do not launch a thread for stdout as the rich.status does not

I think we can assert process_stream=False when ctx is not None for now and refine process_subprocess_stream when we broad the usage of async context

cg505 · 2025-04-29T00:01:09Z

sky/server/requests/executor.py

@@ -390,6 +393,114 @@ def _request_execution_wrapper(request_id: str,
            logger.info(f'Request {request_id} finished')


+async def execute_request(request: api_requests.Request):


Can we rename to clarify that this uses a coroutine rather than the typical request submission?

Suggested change

async def execute_request(request: api_requests.Request):

async def execute_request_coroutine(request: api_requests.Request):

The async keyword implies the same semantic here as "coroutine" and there would be a lint failure if execute_request is not called with await or asyncio primitives. For request that is executed in process executor, we use schedule_request. So maybe it is okay to keep the current one for brevity, wdyt?

I still would like some distinction. If not in the name, in a comment, but I think the name is better.

My concern is not that this function must be run in a coroutine. As you say, that's clear from the async keyword. It's more about clarifying the type of request (a "coroutine" request vs a normal executor-based request).

The main concern: If I am new to skypilot and look at this function, I might expect that all requests are going to be executed using this function. In fact, only coroutine requests use this. For other requests, there is a totally different code path that the executor uses to execute the request. I may not realize that the executor does not run requests using coroutines.

sky/server/requests/executor.py

cg505 · 2025-04-29T00:05:09Z

sky/server/requests/executor.py

+    request.log_path.touch()
+    return request
+
+
 def schedule_request(


TODO: refactor schedule_request to take a Request, for consistency with the prepare_request / execute_request flow. That way we can remove all the args and move them/the docstring to prepare_request

Agreed! I also considered the same approach and postponed this to make this PR more focused, follow up: #5434

Signed-off-by: Aylei <[email protected]>

aylei added 2 commits April 25, 2025 22:53

[API server] handle logs request in event loop

ddb8817

Signed-off-by: Aylei <[email protected]>

Correct log streaming redirection

9d8c577

Signed-off-by: Aylei <[email protected]>

aylei force-pushed the async-log branch from bf48b7d to 9d8c577 Compare April 25, 2025 16:19

aylei added 2 commits April 27, 2025 14:33

Refactor: use contextual stdout/stderr instead

7e9f7ba

Signed-off-by: Aylei <[email protected]>

Refinments

8a2841a

Signed-off-by: Aylei <[email protected]>

aylei changed the title ~~[API server] handle logs request in event loop~~ [API server] handle logs request in coroutine Apr 27, 2025

aylei added 4 commits April 27, 2025 17:12

Lint

190cf35

Signed-off-by: Aylei <[email protected]>

Merge branch 'master' into async-log

36c0b4a

Fix RayCodeGen

8abf811

Signed-off-by: Aylei <[email protected]>

Fix unit test

6eae81f

Signed-off-by: Aylei <[email protected]>

aylei marked this pull request as ready for review April 27, 2025 12:50

aylei requested a review from Michaelvll April 27, 2025 13:06

Michaelvll requested a review from cg505 April 28, 2025 18:57

cg505 reviewed Apr 29, 2025

View reviewed changes

aylei mentioned this pull request Apr 29, 2025

[API server] consolidate the request preparation to a single entrypiont #5434

Open

Address review comments

8dd548e

Signed-off-by: Aylei <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API server] handle logs request in coroutine #5366

[API server] handle logs request in coroutine #5366

aylei commented Apr 25, 2025 •

edited

Loading

aylei commented Apr 27, 2025

aylei commented Apr 28, 2025

cg505 left a comment

cg505 Apr 28, 2025

aylei Apr 29, 2025 •

edited

Loading

cg505 Apr 28, 2025

cg505 Apr 28, 2025

aylei Apr 29, 2025

cg505 Apr 28, 2025

cg505 Apr 28, 2025

aylei Apr 29, 2025

cg505 Apr 29, 2025

aylei Apr 29, 2025

cg505 Apr 29, 2025

cg505 Apr 29, 2025

aylei Apr 29, 2025

		@@ -390,6 +393,114 @@ def _request_execution_wrapper(request_id: str,
		logger.info(f'Request {request_id} finished')


		async def execute_request(request: api_requests.Request):

	async def execute_request(request: api_requests.Request):
	async def execute_request_coroutine(request: api_requests.Request):

[API server] handle logs request in coroutine #5366

Are you sure you want to change the base?

[API server] handle logs request in coroutine #5366

Conversation

aylei commented Apr 25, 2025 • edited Loading

Benchmark

Tests

aylei commented Apr 27, 2025

aylei commented Apr 28, 2025

cg505 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aylei Apr 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aylei commented Apr 25, 2025 •

edited

Loading

aylei Apr 29, 2025 •

edited

Loading