Skip to content

Latest commit

 

History

History
90 lines (69 loc) · 7.68 KB

telemetry.md

File metadata and controls

90 lines (69 loc) · 7.68 KB

Telemetry

Clockwork's controller outputs telemetry about client requests, and about actions sent to workers.

Location

Telemetry is output by the controller process.

The location of the telemetry files is configured by the CLOCKWORK_LOG_DIR environment variable, which should point to a directory where the controller can write its output request logs. Be aware that for long experiments, these files can be GB large.

If CLOCKWORK_LOG_DIR is not specified or if it is left blank, Clockwork's controller will write its telemetry files to /local. If /local does not exist on your machine or is not writable, the controller will not write any telemetry files.

CLOCKWORK_LOG_DIR must already exist; Clockwork will not create the directory for you.

Upon startup, the controller will print the location it is writing its request and action logs to.

Requests and Actions

Clockwork's controller outputs telemetry for requests and for actions.

Requests are individual inferences sent by clients to Clockwork. A request specifies a model and provides an input payload; it will receive an output payload or an error.

Actions are commands sent by Clockwork's controllers to its workers. There are several kinds of action: Load, Infer, and Evict are the main ones. There is no 1-to-1 correspondence from requests to actions; an Infer actions can potentially batch the input of multiple requests.

Files

clockwork_request_log.tsv is a tab-separated file containing measurements for every client request received by Clockwork's controller. The first row contains column headers.

clockwork_action_log.tsv is a tab-separated file containing measurements for every action sent by clockwork's controller to workers. The first row contains column headers.

Request Log

The following snippet is taken from a clockwork_request_log.tsv:

t       request_id      result  user_id model_id        slo_factor      latency deadline        deadline_met    arrival_count   departure_count is_coldstart
1598832443793706643     68      0       357     30      0       11428945        100000000       1       0       2       1
1598832443793770631     330     0       115     9       0       9725697 100000000       1       0       2       1
1598832443794996383     506     0       357     30      0       3124478 100000000       1       2       2       0
1598832443795068092     511     0       115     9       0       2827707 100000000       1       2       2       0
  • t the arrival timestamp of the request at the controller in nanoseconds
  • request_id a unique ID for this request, generated by the controller
  • result the outcome of the request. 0 indicates success; 4 indicates a timeout; all other values indicate an error. The other values correspond to error codes from worker_api.h
  • user_id a user ID supplied by the client.
  • model_id a model ID supplied by the client.
  • slo_factor this is a "backdoor" API used for experimentation; it is mostly zero. The client can specify a positive value, which then acts as a scale factor for this request's SLO/deadline (relative to its batchsize-1 latency).
  • latency the request latency measured by the controller, from request arrival at the network layer, to request departure at the network layer.
  • deadline the SLO/deadline actually used by this request
  • deadline_met only true if result=0 and latency<=deadline
  • arrival_count when the request arrived at the controller, the number of workers with this model loaded
  • departure_count when the request departs the controller, the number of workers with this model loaded
  • is_coldstart only true if arrival_count=0

Action Log

The following snippet is taken from a clockwork_action_log.tsv:

t       action_id       action_type     status  worker_id       gpu_id  model_id        batch_size      expected_exec_duration  worker_exec_duration    expected_exec_complete  worker_exec_complete    expected_gpu_clock      worker_gpu_clock_before worker_gpu_clock        worker_copy_output_complete     worker_action_received  worker_result_sent      controller_result_enqueue       controller_action_duration      goodput requests_queued copies_loaded
1598832443789472787     2       1       0       9       1       0       1       5637000 5669663 6604169 6830646 0       0       0       0       170122  6840551 7251604 7263700 5669663 2       0
1598832443789454455     0       1       0       8       0       0       1       5637000 5672895 6605935 6826713 0       0       0       0       410550  6836940 7244669 7276903 5672895 2       0
1598832443789437665     7       1       0       3       0       1       1       5637000 5677567 6594394 6818735 0       0       0       0       325569  6829798 7183274 7297271 5677567 1       0
1598832443789463641     4       1       0       6       0       0       1       5637000 5668928 6595782 6839861 0       0       0       0       237703  6850289 7228631 7251874 5668928 2       0
1
  • t the time the action was sent by the controller to a worker
  • action_id a unique ID for this action
  • action_type 1 for Load Weights; 2 for Infer; 3 for Evict Weights. Other action types can be found in worker_api.h.
  • status the outcome of the action. 0 indicates success; all other values indicate an error. Error codes correspond to those listed in worker_api.h.
  • worker_id the worker this action is sent to
  • gpu_id the GPU on ths worker this action is sent to
  • model_id the model this action is for
  • batch_size for infer actions only
  • expected_exec_duration the worker-side action duration predicted by the controller
  • worker_exec_duration the actual worker-side action duration measured by the worker
  • expected_exec_complete the controller's predicted time when the action will complete; this incorporates the expected_exec_duration as well as any other outstanding actions. This value is relative to t.
  • worker_exec_complete the actual worker-side action completion time. This value is relative to t.
  • expected_gpu_clock the GPU clock speed anticipated by the controller.
  • worker_gpu_clock_before the GPU clock speed before the action begins executing.
  • worker_gpu_clock the GPU clock speed after the action finishes executing.
  • worker_copy_output_complete the worker-side time when inference has completed and output has been copied back to host. This value is relative to t and is adjusted to account for clock skew.
  • worker_action_received the worker-side time when the action finished receiving over the network. This value is relative to t and is adjusted to account for clock skew.
  • worker_result_sent the worker-side time when the result began sending over the network. This value is relative to t and is adjusted to account for clock skew.
  • controller_result_enqueue the controller-side time when the result was received by the worker.
  • controller_action_duration the end-to-end action duration as measured by the controller
  • goodput Used for infer actions. A goodput of 1 indicates an infer action completed in time for a request's deadline. A goodput of 0 indicates the request timed out before the action completed. Fractional values are possible due to batched inputs.
  • requests_queued the number of requests queued on the controller for this model, at the time the action was initiated.
  • copies_loaded the number of workers with this model already loaded into GPU memory, at the time the action was initiated.