|
| 1 | +--- |
| 2 | +layout: docs |
| 3 | +page_title: Garbage collection |
| 4 | +description: |- |
| 5 | + Nomad garbage collects Access Control List (ACL) tokens, allocations, deployments, encryption root keys, evaluations, jobs, nodes, plugins, and Container Storage Interface (CSI) volumes. Learn about server-side and client-side garbage collection processes, including configuration and triggers. |
| 6 | +--- |
| 7 | + |
| 8 | +# Garbage collection |
| 9 | + |
| 10 | +Nomad garbage collection is not the same as garbage collection in a programming |
| 11 | +language, but the motivation behind its design is similar: garbage collection |
| 12 | +frees up memory allocated for objects that the schedular no longer references or |
| 13 | +needs. Nomad only garbage collects objects that are in a terminal state and only |
| 14 | +after a delay to allow inspection or debugging. |
| 15 | + |
| 16 | +Nomad runs garbage collection processes on servers and on client nodes. You may |
| 17 | +also manually trigger garbage collection on the server. |
| 18 | + |
| 19 | +Nomad garbage collects the following objects: |
| 20 | + |
| 21 | +- [ACL token](#configuration) |
| 22 | +- [Allocation](#client-side-garbage-collection) |
| 23 | +- [CSI Plugin](#configuration) |
| 24 | +- [Deployment](#configuration) |
| 25 | +- [Encryption root key](#configuration) |
| 26 | +- [Evaluation](#configuration) |
| 27 | +- [Job](#configuration) |
| 28 | +- [Node](#configuration) |
| 29 | +- [Volume](#configuration) |
| 30 | + |
| 31 | +## Cascading garbage collection |
| 32 | + |
| 33 | +Nomad's scheduled garbage collection processes generally handle each resource |
| 34 | +type independently. However, there is an implicit cascading relationship because |
| 35 | +of how objects reference each other. In practice, when Nomad garbage collects a |
| 36 | +higher-level object, Nomad also removes the object's associated sub-objects to |
| 37 | +prevent orphaned objects. |
| 38 | + |
| 39 | +For example, garbage collecting a job also causes Nomad to drop all of that |
| 40 | +job's remaining evaluations, deployments, and allocation records from the state. |
| 41 | +Nomad garbage collects those objects, either as part of the job garbage |
| 42 | +collection process or by each object's own garbage collection processes running |
| 43 | +immediately after. Nomad's scheduled garbage collection processes only garbage |
| 44 | +collect objects after they are terminal for at least the specified time |
| 45 | +threshold and no longer needed for future scheduling decisions. Note that when |
| 46 | +you force garbage collection by running the `nomad system gc` command, Nomad |
| 47 | +ignores the specified time threshold. |
| 48 | + |
| 49 | +## Server-side garbage collection |
| 50 | + |
| 51 | +The Nomad server leader starts periodic garbage collection processes that clean |
| 52 | +objects marked for garbage collection from memory. Nomad automatically marks |
| 53 | +some objects, like evaluations, for garbage collection. Alternatively, you may |
| 54 | +manually mark jobs for garbage collection by running `nomad system gc`, which |
| 55 | +runs the garbage collection process. |
| 56 | + |
| 57 | +### Configuration |
| 58 | + |
| 59 | +These settings govern garbage collection behavior on the server nodes. You may |
| 60 | +review the intervals in the [`config.go` |
| 61 | +class](https://github.com/hashicorp/nomad/blob/b11619010e1c83488e14e2785569e515b2769062/nomad/config.go#L564) |
| 62 | +for objects without a configurable interval setting. |
| 63 | + |
| 64 | +| Object | Interval | Threshold | |
| 65 | +|---|---|---| |
| 66 | +| **ACL token** | 5 minutes | [`acl_token_gc_threshold`](/nomad/docs/configuration/server#acl_token_gc_threshold)<br/>Default: 1 hour | |
| 67 | +| **CSI Plugin** | 5 minutes | [`csi_plugin_gc_threshold`](/nomad/docs/configuration/server#csi_plugin_gc_threshold)<br/>Default: 1 hour | |
| 68 | +| **Deployment** | 5 minutes | [`deployment_gc_threshold`](/nomad/docs/configuration/server#deployment_gc_threshold)<br/>Default: 1 hour | |
| 69 | +| **Encryption root key** | [`root_key_gc_interval`](/nomad/docs/configuration/server#root_key_gc_interval)<br/>Default: 10 minutes | [`root_key_gc_threshold`](/nomad/docs/configuration/server#root_key_gc_threshold)<br/>Default: 1 hour | |
| 70 | +| **Evaluation** | 5 minutes | [`eval_gc_threshold`](/nomad/docs/configuration/server#eval_gc_threshold) <br/>Default: 1 hour | |
| 71 | +| **Evaluation, batch** | 5 minutes | [`batch_eval_gc_threshold`](/nomad/docs/configuration/server#batch_eval_gc_threshold)<br/>Default: 24 hours | |
| 72 | +| **Job** | [`job_gc_interval`](/nomad/docs/configuration/server#job_gc_interval)<br/>Default: 5 minutes | [`job_gc_threshold`](/nomad/docs/configuration/server#job_gc_threshold)<br/>Default: 4 hours | |
| 73 | +| **Node** | 5 minutes | [`node_gc_threshold`](/nomad/docs/configuration/server#node_gc_threshold)<br/>Default: 24 hours | |
| 74 | +| **Volume** | [`csi_volume_claim_gc_interval`](/nomad/docs/configuration/server#csi_volume_claim_gc_interval)<br/>Default: 5 minutes| [`csi_volume_claim_gc_threshold`](/nomad/docs/configuration/server#csi_volume_claim_gc_threshold)<br/>Default: 1 hour | |
| 75 | + |
| 76 | +### Triggers |
| 77 | + |
| 78 | +The server garbage collection processes wake up at configured intervals to scan |
| 79 | +for any expired or terminal objects to permanently delete, provided the object's |
| 80 | +time in a terminal state exceeds its garbage collection threshold. For example, |
| 81 | +a job's default garbage collection threshold is four hours, so the job must be |
| 82 | +in a terminal state for at least four hours before the garbage collection |
| 83 | +process permanently deletes the job and its dependent objects. |
| 84 | + |
| 85 | +When you force garbage collection by manually running the `nomad system gc` |
| 86 | +command, you are telling the garbage collection process to ignore thresholds and |
| 87 | +immediately purge all terminal objects on all servers and clients. |
| 88 | + |
| 89 | +## Client-side garbage collection |
| 90 | + |
| 91 | +On each client node, Nomad must clean up resources from terminated allocations |
| 92 | +to free disk and memory on the machine. |
| 93 | + |
| 94 | +### Configuration |
| 95 | + |
| 96 | +These settings govern allocation garbage collection behavior on each client node. |
| 97 | + |
| 98 | +| Parameter | Default | Description | |
| 99 | +| -------- | ------- | ------------- | |
| 100 | +| [`gc_interval`](/nomad/docs/configuration/client#gc_interval) | 1 minute | Interval at which Nomad attempts to garbage collect terminal allocation directories | |
| 101 | +| [ `gc_disk_usage_threshold` ](/nomad/docs/configuration/client#gc_disk_usage_threshold) | 80 | Disk usage percent which Nomad tries to maintain by garbage collecting terminal allocations | |
| 102 | +| [ `gc_inode_usage_threshold` ](/nomad/docs/configuration/client#gc_inode_usage_threshold) | 70 | Inode usage percent which Nomad tries to maintain by garbage collecting terminal allocations | |
| 103 | +| [ `gc_max_allocs` ](/nomad/docs/configuration/client#gc_max_allocs) | 50 | Maximum number of allocations which a client will track before triggering a garbage collection of terminal allocations | |
| 104 | +| [ `gc_parallel_destroys ` ](/nomad/docs/configuration/client#gc_parallel_destroys) | 2 | Maximum number of parallel destroys allowed by the garbage collector | |
| 105 | + |
| 106 | +Refer to the [client block in agent configuration |
| 107 | +reference](/nomad/docs/configuration/client) for complete parameter descriptions |
| 108 | +and examples. |
| 109 | + |
| 110 | +Note that there is no time-based retention setting for allocations. Unlike jobs |
| 111 | +or evaluations, you cannot specify a time to keep allocations alive before |
| 112 | +garbage collection. As soon as an allocation is terminal, it becomes eligible |
| 113 | +for cleanup if the configured thresholds demand it. |
| 114 | + |
| 115 | +### Triggers |
| 116 | + |
| 117 | +Nomad's client runs allocation garbage collection based on these triggers: |
| 118 | + |
| 119 | +- Scheduled interval |
| 120 | + |
| 121 | + The garbage collection process launches a ticker based on the configured |
| 122 | + `gc_interval`. On each tick, the garbage collection process checks to see if it needs to remove terminal allocations. |
| 123 | + |
| 124 | +- Terminal state |
| 125 | + |
| 126 | + When an allocation transitions to a terminal state, Nomad marks |
| 127 | + the allocation for garbage collection and then signals the garbage collection |
| 128 | + process to run immediately. |
| 129 | + |
| 130 | +- Allocation placement |
| 131 | + |
| 132 | + Nomad may preemptively run garbage collection to make room for new |
| 133 | + allocations. The client garbage collects older, terminal allocations if adding new allocations would exceed the `gc_max_allocs` limit. |
| 134 | + |
| 135 | +- Forced garbage collection |
| 136 | + |
| 137 | + When you force garbage collection by running the `nomad system gc` command, |
| 138 | + the garbage collection process removes all terminal objects on all servers and |
| 139 | + clients, ignoring thresholds. |
| 140 | + |
| 141 | +Nomad does not continuously monitor disk or inode usage to trigger garbage |
| 142 | +collection. Instead, Nomad only checks disk and inode thresholds when one of the |
| 143 | +aforementioned triggers invokes the garbage collection process. The |
| 144 | +`gc_inode_usage_threshold` and `gc_disk_usage_threshold` values do not trigger |
| 145 | +garbage collection; rather, those values influence how the garbage collector |
| 146 | +behaves during a collection run. |
| 147 | + |
| 148 | +### Allocation selection |
| 149 | + |
| 150 | +When the garbage collection process runs, Nomad destroys as many finished |
| 151 | +allocations as needed to meet the resource thresholds. The client maintains a |
| 152 | +priority queue of terminal allocations ordered by the time they were marked |
| 153 | +finished, oldest first. |
| 154 | + |
| 155 | +The process repeatedly evicts allocations from the queue until the conditions |
| 156 | +are back within configured limits. Specifically, the garbage collection loop |
| 157 | +checks, in order: |
| 158 | + |
| 159 | +1. If disk usage exceeds `gc_disk_usage_threshold` value |
| 160 | +1. If inode usage exceeds `gc_inode_usage_threshold` value |
| 161 | +1. If the count of allocations exceeds `gc_max_allocs` value |
| 162 | + |
| 163 | +If any one of these conditions is true, the garbage collector selects the oldest |
| 164 | +finished allocation for removal. |
| 165 | + |
| 166 | +After deleting one allocation, the loop re-checks the metrics and continues |
| 167 | +removing the next-oldest allocation until all thresholds are satisfied or |
| 168 | +until there are no more terminal allocs. This means in a single run, the |
| 169 | +garbage collection removes multiple allocations back-to-back if the node was |
| 170 | +far over the limits. The evictions happen in termination-time order, which is |
| 171 | +oldest completed allocations first. |
| 172 | + |
| 173 | +If node's usage and allocation count are under the limits, a normal garbage |
| 174 | +collection cycle does not remove any allocations. In other words, periodic and |
| 175 | +event-driven garbage collection does not delete allocations just because they |
| 176 | +are finished. There has to be pressure or a limit reached. The exception is when |
| 177 | +an administrative command or server-side removal triggers client-side garbage |
| 178 | +collection. Aside from that forced scenario, the default behavior is |
| 179 | +threshold-driven: Nomad leaves allocations on disk until it needs to reclaim |
| 180 | +those allocations due to space, inode, or count limits being hit. |
| 181 | + |
| 182 | +### Task driver resources garbage collection |
| 183 | + |
| 184 | +Most task drivers do not have their own garbage collection process. When an |
| 185 | +allocation is terminal, the client garbage collection process communicates with |
| 186 | +the task driver to ensure the task's resources have been cleaned up. Note that |
| 187 | +the Docker task driver periodically cleans up its own resources. Refer to the |
| 188 | +[Docker task driver plugin |
| 189 | +options](https://developer.hashicorp.com/nomad/docs/drivers/docker#gc) for |
| 190 | +details. |
| 191 | + |
| 192 | +When a task has configured restart attempts and the task fails, the Nomad client |
| 193 | +attempts an in-place task restart within the same allocation. The task driver |
| 194 | +starts a new process or container for the task. If the task continues to fail |
| 195 | +and exceeds the configured restart attempts, Nomad terminates the task and marks |
| 196 | +the allocation as terminal. The task driver then cleans up its resources, such |
| 197 | +as a Docker container or cgroups. When the garbage collection process runs, it |
| 198 | +makes sure that the task driver cleanup is done before deleting the allocation. |
| 199 | +If a task driver fails to clean up properly, Nomad logs errors but continues the |
| 200 | +garbage collection process. Task driver cleanup failure issues can influence |
| 201 | +when the allocation truly frees up. For instance, if volumes are not detached, |
| 202 | +disk space might not be fully reclaimed until fixed. |
| 203 | + |
| 204 | +## Resources |
| 205 | + |
| 206 | +- [Nomad's internal garbage collection and optimization discovery during the |
| 207 | + Nomad Bench project blog post](https://www.hashicorp.com/en/blog/nomad-garbage-collection-optimization-discovery-during-nomad-bench) |
| 208 | +- Configuration |
| 209 | + |
| 210 | + - [client Block in Agent Configuration](/nomad/docs/configuration/client) |
| 211 | + - [server Block in Agent Configuration](/nomad/docs/configuration/server) |
| 212 | + |
| 213 | +- [the `nomad system gc` command reference](/nomad/docs/commands/system/gc) |
| 214 | +- [System HTTP API Force GC](/nomad/api-docs/system#force-gc) |
0 commit comments