Skip to content

Commit 4075b0b

Browse files
authored
Docs: Add garbage collection page (#25715)
* add garbage collection page * finish client; add resources section * finish server section; task driver section * add front matter description * fix typos * Address Tim's feedback
1 parent a4dd1c9 commit 4075b0b

File tree

2 files changed

+218
-0
lines changed

2 files changed

+218
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
---
2+
layout: docs
3+
page_title: Garbage collection
4+
description: |-
5+
Nomad garbage collects Access Control List (ACL) tokens, allocations, deployments, encryption root keys, evaluations, jobs, nodes, plugins, and Container Storage Interface (CSI) volumes. Learn about server-side and client-side garbage collection processes, including configuration and triggers.
6+
---
7+
8+
# Garbage collection
9+
10+
Nomad garbage collection is not the same as garbage collection in a programming
11+
language, but the motivation behind its design is similar: garbage collection
12+
frees up memory allocated for objects that the schedular no longer references or
13+
needs. Nomad only garbage collects objects that are in a terminal state and only
14+
after a delay to allow inspection or debugging.
15+
16+
Nomad runs garbage collection processes on servers and on client nodes. You may
17+
also manually trigger garbage collection on the server.
18+
19+
Nomad garbage collects the following objects:
20+
21+
- [ACL token](#configuration)
22+
- [Allocation](#client-side-garbage-collection)
23+
- [CSI Plugin](#configuration)
24+
- [Deployment](#configuration)
25+
- [Encryption root key](#configuration)
26+
- [Evaluation](#configuration)
27+
- [Job](#configuration)
28+
- [Node](#configuration)
29+
- [Volume](#configuration)
30+
31+
## Cascading garbage collection
32+
33+
Nomad's scheduled garbage collection processes generally handle each resource
34+
type independently. However, there is an implicit cascading relationship because
35+
of how objects reference each other. In practice, when Nomad garbage collects a
36+
higher-level object, Nomad also removes the object's associated sub-objects to
37+
prevent orphaned objects.
38+
39+
For example, garbage collecting a job also causes Nomad to drop all of that
40+
job's remaining evaluations, deployments, and allocation records from the state.
41+
Nomad garbage collects those objects, either as part of the job garbage
42+
collection process or by each object's own garbage collection processes running
43+
immediately after. Nomad's scheduled garbage collection processes only garbage
44+
collect objects after they are terminal for at least the specified time
45+
threshold and no longer needed for future scheduling decisions. Note that when
46+
you force garbage collection by running the `nomad system gc` command, Nomad
47+
ignores the specified time threshold.
48+
49+
## Server-side garbage collection
50+
51+
The Nomad server leader starts periodic garbage collection processes that clean
52+
objects marked for garbage collection from memory. Nomad automatically marks
53+
some objects, like evaluations, for garbage collection. Alternatively, you may
54+
manually mark jobs for garbage collection by running `nomad system gc`, which
55+
runs the garbage collection process.
56+
57+
### Configuration
58+
59+
These settings govern garbage collection behavior on the server nodes. You may
60+
review the intervals in the [`config.go`
61+
class](https://github.com/hashicorp/nomad/blob/b11619010e1c83488e14e2785569e515b2769062/nomad/config.go#L564)
62+
for objects without a configurable interval setting.
63+
64+
| Object | Interval | Threshold |
65+
|---|---|---|
66+
| **ACL token** | 5 minutes | [`acl_token_gc_threshold`](/nomad/docs/configuration/server#acl_token_gc_threshold)<br/>Default: 1 hour |
67+
| **CSI Plugin** | 5 minutes | [`csi_plugin_gc_threshold`](/nomad/docs/configuration/server#csi_plugin_gc_threshold)<br/>Default: 1 hour |
68+
| **Deployment** | 5 minutes | [`deployment_gc_threshold`](/nomad/docs/configuration/server#deployment_gc_threshold)<br/>Default: 1 hour |
69+
| **Encryption root key** | [`root_key_gc_interval`](/nomad/docs/configuration/server#root_key_gc_interval)<br/>Default: 10 minutes | [`root_key_gc_threshold`](/nomad/docs/configuration/server#root_key_gc_threshold)<br/>Default: 1 hour |
70+
| **Evaluation** | 5 minutes | [`eval_gc_threshold`](/nomad/docs/configuration/server#eval_gc_threshold) <br/>Default: 1 hour |
71+
| **Evaluation, batch** | 5 minutes | [`batch_eval_gc_threshold`](/nomad/docs/configuration/server#batch_eval_gc_threshold)<br/>Default: 24 hours |
72+
| **Job** | [`job_gc_interval`](/nomad/docs/configuration/server#job_gc_interval)<br/>Default: 5 minutes | [`job_gc_threshold`](/nomad/docs/configuration/server#job_gc_threshold)<br/>Default: 4 hours |
73+
| **Node** | 5 minutes | [`node_gc_threshold`](/nomad/docs/configuration/server#node_gc_threshold)<br/>Default: 24 hours |
74+
| **Volume** | [`csi_volume_claim_gc_interval`](/nomad/docs/configuration/server#csi_volume_claim_gc_interval)<br/>Default: 5 minutes| [`csi_volume_claim_gc_threshold`](/nomad/docs/configuration/server#csi_volume_claim_gc_threshold)<br/>Default: 1 hour |
75+
76+
### Triggers
77+
78+
The server garbage collection processes wake up at configured intervals to scan
79+
for any expired or terminal objects to permanently delete, provided the object's
80+
time in a terminal state exceeds its garbage collection threshold. For example,
81+
a job's default garbage collection threshold is four hours, so the job must be
82+
in a terminal state for at least four hours before the garbage collection
83+
process permanently deletes the job and its dependent objects.
84+
85+
When you force garbage collection by manually running the `nomad system gc`
86+
command, you are telling the garbage collection process to ignore thresholds and
87+
immediately purge all terminal objects on all servers and clients.
88+
89+
## Client-side garbage collection
90+
91+
On each client node, Nomad must clean up resources from terminated allocations
92+
to free disk and memory on the machine.
93+
94+
### Configuration
95+
96+
These settings govern allocation garbage collection behavior on each client node.
97+
98+
| Parameter | Default | Description |
99+
| -------- | ------- | ------------- |
100+
| [`gc_interval`](/nomad/docs/configuration/client#gc_interval) | 1 minute | Interval at which Nomad attempts to garbage collect terminal allocation directories |
101+
| [ `gc_disk_usage_threshold` ](/nomad/docs/configuration/client#gc_disk_usage_threshold) | 80 | Disk usage percent which Nomad tries to maintain by garbage collecting terminal allocations |
102+
| [ `gc_inode_usage_threshold` ](/nomad/docs/configuration/client#gc_inode_usage_threshold) | 70 | Inode usage percent which Nomad tries to maintain by garbage collecting terminal allocations |
103+
| [ `gc_max_allocs` ](/nomad/docs/configuration/client#gc_max_allocs) | 50 | Maximum number of allocations which a client will track before triggering a garbage collection of terminal allocations |
104+
| [ `gc_parallel_destroys ` ](/nomad/docs/configuration/client#gc_parallel_destroys) | 2 | Maximum number of parallel destroys allowed by the garbage collector |
105+
106+
Refer to the [client block in agent configuration
107+
reference](/nomad/docs/configuration/client) for complete parameter descriptions
108+
and examples.
109+
110+
Note that there is no time-based retention setting for allocations. Unlike jobs
111+
or evaluations, you cannot specify a time to keep allocations alive before
112+
garbage collection. As soon as an allocation is terminal, it becomes eligible
113+
for cleanup if the configured thresholds demand it.
114+
115+
### Triggers
116+
117+
Nomad's client runs allocation garbage collection based on these triggers:
118+
119+
- Scheduled interval
120+
121+
The garbage collection process launches a ticker based on the configured
122+
`gc_interval`. On each tick, the garbage collection process checks to see if it needs to remove terminal allocations.
123+
124+
- Terminal state
125+
126+
When an allocation transitions to a terminal state, Nomad marks
127+
the allocation for garbage collection and then signals the garbage collection
128+
process to run immediately.
129+
130+
- Allocation placement
131+
132+
Nomad may preemptively run garbage collection to make room for new
133+
allocations. The client garbage collects older, terminal allocations if adding new allocations would exceed the `gc_max_allocs` limit.
134+
135+
- Forced garbage collection
136+
137+
When you force garbage collection by running the `nomad system gc` command,
138+
the garbage collection process removes all terminal objects on all servers and
139+
clients, ignoring thresholds.
140+
141+
Nomad does not continuously monitor disk or inode usage to trigger garbage
142+
collection. Instead, Nomad only checks disk and inode thresholds when one of the
143+
aforementioned triggers invokes the garbage collection process. The
144+
`gc_inode_usage_threshold` and `gc_disk_usage_threshold` values do not trigger
145+
garbage collection; rather, those values influence how the garbage collector
146+
behaves during a collection run.
147+
148+
### Allocation selection
149+
150+
When the garbage collection process runs, Nomad destroys as many finished
151+
allocations as needed to meet the resource thresholds. The client maintains a
152+
priority queue of terminal allocations ordered by the time they were marked
153+
finished, oldest first.
154+
155+
The process repeatedly evicts allocations from the queue until the conditions
156+
are back within configured limits. Specifically, the garbage collection loop
157+
checks, in order:
158+
159+
1. If disk usage exceeds `gc_disk_usage_threshold` value
160+
1. If inode usage exceeds `gc_inode_usage_threshold` value
161+
1. If the count of allocations exceeds `gc_max_allocs` value
162+
163+
If any one of these conditions is true, the garbage collector selects the oldest
164+
finished allocation for removal.
165+
166+
After deleting one allocation, the loop re-checks the metrics and continues
167+
removing the next-oldest allocation until all thresholds are satisfied or
168+
until there are no more terminal allocs. This means in a single run, the
169+
garbage collection removes multiple allocations back-to-back if the node was
170+
far over the limits. The evictions happen in termination-time order, which is
171+
oldest completed allocations first.
172+
173+
If node's usage and allocation count are under the limits, a normal garbage
174+
collection cycle does not remove any allocations. In other words, periodic and
175+
event-driven garbage collection does not delete allocations just because they
176+
are finished. There has to be pressure or a limit reached. The exception is when
177+
an administrative command or server-side removal triggers client-side garbage
178+
collection. Aside from that forced scenario, the default behavior is
179+
threshold-driven: Nomad leaves allocations on disk until it needs to reclaim
180+
those allocations due to space, inode, or count limits being hit.
181+
182+
### Task driver resources garbage collection
183+
184+
Most task drivers do not have their own garbage collection process. When an
185+
allocation is terminal, the client garbage collection process communicates with
186+
the task driver to ensure the task's resources have been cleaned up. Note that
187+
the Docker task driver periodically cleans up its own resources. Refer to the
188+
[Docker task driver plugin
189+
options](https://developer.hashicorp.com/nomad/docs/drivers/docker#gc) for
190+
details.
191+
192+
When a task has configured restart attempts and the task fails, the Nomad client
193+
attempts an in-place task restart within the same allocation. The task driver
194+
starts a new process or container for the task. If the task continues to fail
195+
and exceeds the configured restart attempts, Nomad terminates the task and marks
196+
the allocation as terminal. The task driver then cleans up its resources, such
197+
as a Docker container or cgroups. When the garbage collection process runs, it
198+
makes sure that the task driver cleanup is done before deleting the allocation.
199+
If a task driver fails to clean up properly, Nomad logs errors but continues the
200+
garbage collection process. Task driver cleanup failure issues can influence
201+
when the allocation truly frees up. For instance, if volumes are not detached,
202+
disk space might not be fully reclaimed until fixed.
203+
204+
## Resources
205+
206+
- [Nomad's internal garbage collection and optimization discovery during the
207+
Nomad Bench project blog post](https://www.hashicorp.com/en/blog/nomad-garbage-collection-optimization-discovery-during-nomad-bench)
208+
- Configuration
209+
210+
- [client Block in Agent Configuration](/nomad/docs/configuration/client)
211+
- [server Block in Agent Configuration](/nomad/docs/configuration/server)
212+
213+
- [the `nomad system gc` command reference](/nomad/docs/commands/system/gc)
214+
- [System HTTP API Force GC](/nomad/api-docs/system#force-gc)

website/data/docs-nav-data.json

+4
Original file line numberDiff line numberDiff line change
@@ -2393,6 +2393,10 @@
23932393
{
23942394
"title": "IPv6 Support",
23952395
"path": "operations/ipv6-support"
2396+
},
2397+
{
2398+
"title": "Garbage Collection",
2399+
"path": "operations/garbage-collection"
23962400
}
23972401
]
23982402
},

0 commit comments

Comments
 (0)