Clockwork's controller outputs telemetry about client requests, and about actions sent to workers.
Telemetry is output by the controller
process.
The location of the telemetry files is configured by the CLOCKWORK_LOG_DIR
environment variable, which should point to a directory where the controller can write its output request logs. Be aware that for long experiments, these files can be GB large.
If CLOCKWORK_LOG_DIR
is not specified or if it is left blank, Clockwork's controller will write its telemetry files to /local
. If /local
does not exist on your machine or is not writable, the controller will not write any telemetry files.
CLOCKWORK_LOG_DIR
must already exist; Clockwork will not create the directory for you.
Upon startup, the controller will print the location it is writing its request and action logs to.
Clockwork's controller outputs telemetry for requests and for actions.
Requests are individual inferences sent by clients to Clockwork. A request specifies a model and provides an input payload; it will receive an output payload or an error.
Actions are commands sent by Clockwork's controllers to its workers. There are several kinds of action: Load
, Infer
, and Evict
are the main ones. There is no 1-to-1 correspondence from requests to actions; an Infer
actions can potentially batch the input of multiple requests.
clockwork_request_log.tsv
is a tab-separated file containing measurements for every client request received by Clockwork's controller. The first row contains column headers.
clockwork_action_log.tsv
is a tab-separated file containing measurements for every action sent by clockwork's controller to workers. The first row contains column headers.
The following snippet is taken from a clockwork_request_log.tsv
:
t request_id result user_id model_id slo_factor latency deadline deadline_met arrival_count departure_count is_coldstart
1598832443793706643 68 0 357 30 0 11428945 100000000 1 0 2 1
1598832443793770631 330 0 115 9 0 9725697 100000000 1 0 2 1
1598832443794996383 506 0 357 30 0 3124478 100000000 1 2 2 0
1598832443795068092 511 0 115 9 0 2827707 100000000 1 2 2 0
t
the arrival timestamp of the request at the controller in nanosecondsrequest_id
a unique ID for this request, generated by the controllerresult
the outcome of the request.0
indicates success;4
indicates a timeout; all other values indicate an error. The other values correspond to error codes fromworker_api.h
user_id
a user ID supplied by the client.model_id
a model ID supplied by the client.slo_factor
this is a "backdoor" API used for experimentation; it is mostly zero. The client can specify a positive value, which then acts as a scale factor for this request's SLO/deadline (relative to its batchsize-1 latency).latency
the request latency measured by the controller, from request arrival at the network layer, to request departure at the network layer.deadline
the SLO/deadline actually used by this requestdeadline_met
only true ifresult=0
andlatency<=deadline
arrival_count
when the request arrived at the controller, the number of workers with this model loadeddeparture_count
when the request departs the controller, the number of workers with this model loadedis_coldstart
only true ifarrival_count=0
The following snippet is taken from a clockwork_action_log.tsv
:
t action_id action_type status worker_id gpu_id model_id batch_size expected_exec_duration worker_exec_duration expected_exec_complete worker_exec_complete expected_gpu_clock worker_gpu_clock_before worker_gpu_clock worker_copy_output_complete worker_action_received worker_result_sent controller_result_enqueue controller_action_duration goodput requests_queued copies_loaded
1598832443789472787 2 1 0 9 1 0 1 5637000 5669663 6604169 6830646 0 0 0 0 170122 6840551 7251604 7263700 5669663 2 0
1598832443789454455 0 1 0 8 0 0 1 5637000 5672895 6605935 6826713 0 0 0 0 410550 6836940 7244669 7276903 5672895 2 0
1598832443789437665 7 1 0 3 0 1 1 5637000 5677567 6594394 6818735 0 0 0 0 325569 6829798 7183274 7297271 5677567 1 0
1598832443789463641 4 1 0 6 0 0 1 5637000 5668928 6595782 6839861 0 0 0 0 237703 6850289 7228631 7251874 5668928 2 0
1
t
the time the action was sent by the controller to a workeraction_id
a unique ID for this actionaction_type
1
for Load Weights;2
for Infer;3
for Evict Weights. Other action types can be found inworker_api.h
.status
the outcome of the action.0
indicates success; all other values indicate an error. Error codes correspond to those listed inworker_api.h
.worker_id
the worker this action is sent togpu_id
the GPU on ths worker this action is sent tomodel_id
the model this action is forbatch_size
for infer actions onlyexpected_exec_duration
the worker-side action duration predicted by the controllerworker_exec_duration
the actual worker-side action duration measured by the workerexpected_exec_complete
the controller's predicted time when the action will complete; this incorporates theexpected_exec_duration
as well as any other outstanding actions. This value is relative tot
.worker_exec_complete
the actual worker-side action completion time. This value is relative tot
.expected_gpu_clock
the GPU clock speed anticipated by the controller.worker_gpu_clock_before
the GPU clock speed before the action begins executing.worker_gpu_clock
the GPU clock speed after the action finishes executing.worker_copy_output_complete
the worker-side time when inference has completed and output has been copied back to host. This value is relative tot
and is adjusted to account for clock skew.worker_action_received
the worker-side time when the action finished receiving over the network. This value is relative tot
and is adjusted to account for clock skew.worker_result_sent
the worker-side time when the result began sending over the network. This value is relative tot
and is adjusted to account for clock skew.controller_result_enqueue
the controller-side time when the result was received by the worker.controller_action_duration
the end-to-end action duration as measured by the controllergoodput
Used for infer actions. A goodput of 1 indicates an infer action completed in time for a request's deadline. A goodput of 0 indicates the request timed out before the action completed. Fractional values are possible due to batched inputs.requests_queued
the number of requests queued on the controller for this model, at the time the action was initiated.copies_loaded
the number of workers with this model already loaded into GPU memory, at the time the action was initiated.