Hardening Garage for production
Getting Garage running is an afternoon. Running it so one bad day stays small is the part almost nobody writes about. Every hardening control, ranked S to F from live production, failures kept in.
Plenty of guides get Garage running. Install the binary, set an rpc_secret, put a reverse proxy in front of port 3900, and you have a working S3 endpoint. Far fewer cover the next question: how do you run it so one bad day stays small?
We run Garage in production behind Storm Buckets, our multi-tenant S3 service. A fair amount of what's below is what serving production taught us.
We also contribute to Garage upstream. When a fix belongs in the engine, that's where we put it: idempotent bulk deletes and CORS for local-alias buckets are both merged into Garage, and the build we run is the public one.
How to read this
Everything is ranked by one question: how much worse is your day if you skip it?
-
S tier: skip any one of these and the rest is decoration.
-
A, B, C: real controls, descending by how much they actually buy you.
-
D tier: theater. Things that feel like security without delivering it.
-
F tier: footguns. Things that quietly open a hole. A few of them ship in configs people call "production-ready."
Most of these shrink your blast radius, how far a single compromise can reach. A few buy something else, like detection or durability, and those are marked where they show up, so you never mistake a logging tool for a wall.
Start at the top. If all you ever do is the S tier, you're already ahead of most Garage boxes running today.
S tier
These three decide how far a single compromise can travel before it hits something that stops it. Everything in the lower tiers assumes they are already in place.
S1. Run the container runtime rootless
The default Docker install runs its daemon as root. Every container it launches descends from a root process, and membership in the docker group is root-equivalent: anyone in that group can ask the daemon to mount the host filesystem and hand back a root shell. So a single container escape, or one compromised account that happens to be in that group, compromises the entire host.
Rootless mode moves the daemon and its containers under an unprivileged user. A container escape lands the attacker as that user, with that user's permissions and nothing else. Host root, the other services on the box, the rest of the disk: all still out of reach. That is the S-tier blast-radius argument in a single move.
Rootless is not free, and two things need sorting the first time:
-
Privileged ports. A rootless daemon cannot bind 80 or 443, so we don't ask it to. A host-native reverse proxy, running as its own dedicated user, owns the privileged ports; the containers bind high ports on loopback and the proxy forwards to them. That also buys a second, separate blast radius at the edge. (Full treatment in S3.)
-
A group that should not be there. Installing the Docker packages can create a
dockergroup as a side effect even when nothing uses it. Rootless never touches it, but it sits on the box as a latent escalation path. We verify the security property, that no account is a member, rather than trusting the group is simply absent.
One more thing: rootless state belongs to the user that owns it, not under /etc/. Keep the daemon's config, sockets, and data in the unprivileged user's home, and run it as a systemd user service with lingering enabled so it survives logout and returns on boot.
What to do: run the runtime rootless under a dedicated unprivileged user, put a host-native reverse proxy in front for the privileged ports, and confirm no human account sits in a docker group. If you change nothing else in this list, change this.
Read more: Docker's official rootless mode docs, https://docs.docker.com/engine/security/rootless/
Check Your Box:
docker info -f '{{.SecurityOptions}}'
# want the list to include: name=rootless
getent group docker
# want: no output
S2. Keep admin authority off the internet-facing box
The admin token is the credential you least want exposed. It is not scoped to one account: whoever holds Garage's master admin_token can list every bucket, mint keys against any of them, and delete data cluster-wide. So it should live with the one component that uses it, be readable by nothing else, and be reachable from nothing on the public path.
The admin GUI is where it bites self-hosters: garage-webui and the like read admin_token straight out of garage.toml, ship with no auth by default, and the usual advice stops at "put a reverse proxy in front for TLS," which leaves cluster-wide authority one missing password from the internet. If you run one, turn its auth on and keep it off the public path.
If a public app you wrote fronts your storage, go further: the app holds no admin token at all. When it needs a privileged action, creating a bucket, rotating a key, it sends an authenticated request to a component on the storage node that holds the credential locally and calls the admin API over loopback. The internet-facing tier can ask for an action. It cannot perform one, and there is nothing in its config worth stealing.
Our version is Storm Pulse, an outbound-only agent that holds the credential and runs a fixed, whitelisted command set. The pattern can take many forms, a small privileged helper behind a local socket, a worker pulling from a queue, but what matters is the same: the public tier can ask, and something off the public path does the work.
Keeping authority off the public surface is not free:
- More parts, and an allowlist to maintain. A non-public component now holds the token and runs a fixed set of operations on request, so you enumerate exactly what the public app can trigger. More moving pieces than handing over the token, and that is the point: a compromised app can only ask for what is on the list.
What to do: keep the admin token off the internet-facing component entirely; have a non-public, on-box component hold it and execute a fixed set of operations on request; and if you hold an admin token at all, scope it, give it an expiry, and store it as a tight-permission file rather than inline in any config the public tier can read.
Read more: Garage's admin API reference, scoped and expiring tokens, https://garagehq.deuxfleurs.fr/documentation/reference-manual/admin-api/
Check Your Box:
TOKEN=~/garage/etc/secrets/admin_token # the admin token file your broker reads
stat -c '%U %a' "$TOKEN"
# want: owned by the broker's user, mode 600
Want the broker's user and mode 600. Group- or world-readable means every process on the node is one cat from cluster-wide admin.
S3. Keep your S3 to yourself
Most server software binds 0.0.0.0 by default: every interface, reachable by anything that can route to the box. Garage is no exception. Its S3 API on 3900, its web endpoint on 3902, and its admin API on 3903 will all listen on every interface unless you tell them not to. If those ports are publicly reachable, clients hit them raw: no TLS, no access log, no place to apply limits, and the open socket is attack surface on its own.
So bind every one of them to loopback, and put a single reverse proxy on the public ports. The proxy is the only thing listening on 0.0.0.0:443. It terminates TLS, records the real client IP, and forwards to the loopback services behind it. One front door, and everything else is unreachable from outside the box.
The payoff is the blast radius. The only internet-facing listener is now a proxy holding no data and no credentials, so compromising it gets you a forwarder, storage untouched behind it. S3, web, and admin all answer only to localhost.
This is not free, and two things bite:
-
The config can say loopback while the socket does not. This is the one that got us. Our compose said
127.0.0.1:3900and the rootless network backend published on0.0.0.0anyway, an upstream quirk we only caught by reading the live socket. The firewall was holding, but the bind was wrong while the config swore it was right. Check the running socket, not the file that describes it. -
The proxy inherits the jobs the services used to do. Once everything sits behind it, the proxy is where TLS terminates and where the real client IP gets recorded and forwarded. Move the listeners to loopback without setting that up and you have hidden the services and blinded your own logs at the same time. The proxy has to do the work the direct connection used to.
One more thing: bind all of them, not just the obvious one. It is easy to loopback the headline S3 port and leave the web, admin, or metrics endpoints on 0.0.0.0 because nobody thinks about them. Those are the ones that get found. Loop the whole set.
What to do: bind every service endpoint (S3, web, admin, and any metrics or health endpoint) to loopback or a unix socket; run exactly one reverse proxy on the public ports; have it terminate TLS and forward the real client IP; and verify the live sockets, not the config, because the two can disagree.
Read more: Garage's reverse-proxy cookbook, https://garagehq.deuxfleurs.fr/documentation/cookbook/reverse-proxy/
Check Your Box:
ss -tln | grep -E '0\.0\.0\.0:(3900|3902|3903)'
# want: no output
No output is the goal: nothing answering on all interfaces. Any line here is a service exposed to the whole internet, firewall permitting. A clean result confirms the bind only, so check the public side separately with an off-box probe to the port.
A tier
With the S tier in place, a single compromise is boxed in: stuck on one host, no admin credential to grab, nothing exposed to the open internet. The A tier works inside those walls, shrinking what an attacker already on the box can reach and making sure you notice they got there. Ranked the same way, most to least.
A1. Keep secrets in their own files, not in the config
Garage's config holds two secrets worth protecting: the rpc_secret, the shared key any node uses to join the cluster and read or write all your data, and the admin_token from S2, the cluster-wide admin credential. Inline in garage.toml, each is exposed to everything that can read the file, and the config gets read by more than you'd think: the CLI needs it, an admin GUI mounts it, a careless commit lands it in git. The secret's blast radius becomes the config's.
Point at a file instead: rpc_secret_file, admin_token_file, metrics_token_file. The secret sits in its own file at tight permissions while the config stays as readable as it needs to be. Garage enforces this, refusing to start if a secret file is world-readable.
- Don't disable the guard.
GARAGE_ALLOW_WORLD_READABLE_SECRETS=trueturns that check off. If Garage won't start, fix the file mode, not the check.
What to do: move rpc_secret, admin_token, and metrics_token into separate files referenced by the *_file options, each owned by the garage user at mode 600 and kept out of version control.
Read more: Garage's configuration reference, https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/
Check Your Box:
CONFIG=~/garage/etc/garage.toml # your garage config file
grep -nE '^[[:space:]]*(rpc_secret|admin_token|metrics_token)[[:space:]]*=' "$CONFIG"
# want: no output
No output means no secret sits inline, only the _file forms. Any line is a secret in plaintext, and the file it points at still needs mode 600.
A2. Default-deny the firewall, and treat RPC as the one exception
A box should accept only the traffic it exists to serve and refuse the rest by default. After S3 that allow list is short: 443 (plus 80 to redirect) to the reverse proxy, SSH from wherever you administer the box, nothing else. Default-deny is what makes it hold. A debug port you forgot, a service that binds wide on its next restart, an endpoint you add later, all stay shut until you open them deliberately.
- RPC is the exception, so scope it tight. Port 3901 can't live on loopback in a multi-node cluster: the nodes reach each other on it, so it's the one Garage port that has to listen on a routable interface. Let the firewall allow 3901 only from your other nodes' addresses, never the open internet. The
rpc_secretproves membership, but an exposed port is attack surface on its own, so don't lean on the secret alone. (One node, no peers: 3901 needs no external opening at all.)
What to do: set the firewall to default-deny inbound, allow only 80/443 to the proxy and SSH from your admin source, and restrict 3901 to your peer nodes' addresses, or leave it closed entirely on a single node. Deny everything else.
Read more: Garage's cluster deployment cookbook, where the RPC port and inter-node networking are covered, https://garagehq.deuxfleurs.fr/documentation/cookbook/real-world/
Check Your Box:
sudo ufw status verbose | grep -i 'default:'
# want: deny (incoming)
A3. Give every key the least access it needs
Garage has no IAM and no bucket policies. Access is one flat grant: per key, per bucket, three bits, read / write / owner. Whatever you grant a key is what it reaches if it leaks, and keys leak the way secrets do, into a repo, a client config, a log. So each grant's scope is a blast radius you're choosing.
The reflex works against you. Almost every Garage tutorial, the official quick-start included, grants --read --write --owner in one line and moves on. Most workloads need less: a backup target needs write, an asset reader needs read, and almost nothing that isn't administering the bucket needs owner.
owneris not "rw plus convenience." The owner bit lets a key change the bucket's own configuration: its website setting, its CORS, its permission grants. A leaked owner key can flip a private bucket into a public website or loosen who else can reach it, well past reading and writing objects. Reserve owner for keys that actually administer the bucket.
What to do: grant the minimum bits per key per bucket (read for consumers, read+write for writers, owner only for administration), scope each key to the buckets it needs rather than reusing one broad key, and revoke excess with garage bucket deny.
Read more: Garage's quick-start, which covers bucket allow / deny and the permission bits, https://garagehq.deuxfleurs.fr/documentation/quick-start/
Check Your Box:
KEY=my-app-key # an access key id or name
garage key info "$KEY"
# want: Authorized buckets shows only what it needs, at the lowest tier
garage key info lists every bucket the key reaches with its R/W/O flags. Each should be only the buckets that key needs, at the lowest tier, with O only where it administers the bucket. An RWO key for a read-only workload is the over-grant; pare it back with garage bucket deny.
B tier
S and A handle the compromise that lands on the box: how far it travels, what it reaches, whether you notice. B is what's left once that holds, the controls that protect the data itself, from the host it lives on, from a bad disk day, and from code you never wrote.
B1. Encrypt client-side if you need protection from the operator
Everything in S and A is internal hardening, aimed at shrinking what a compromise reaches. None of it hides your data from the operator. Whoever has root on the box can read what sits on it, and binding to loopback never changes that. The only real protection from the operator is keeping the key off the host: encrypt client-side before upload, so Garage only ever stores ciphertext, or use SSE-C, where you pass the key per request and Garage discards it after. Garage deliberately holds no keys for you, because a key the server holds is one the operator can read. If that threat is in your model, encrypt before the bytes arrive. We hold ourselves to the same line.
B2. Know exactly what keeps your data durable [durability]
Durability is not a property of the filesystem you happened to get. It's the specific guarantees you can name: how many zones hold a copy (replication factor), whether metadata survives an unclean shutdown (metadata fsync plus scheduled snapshots), and whether a real backup exists off the cluster. We learned this the hard way. We committed to ZFS to protect Garage's metadata, the provider delivered LVM ext4 that ZFS can't ride, and the fix was to move durability up the stack, metadata_fsync = true and a scheduled metadata snapshot, instead of forcing the wrong filesystem onto the box. RAID and replication are not backups either: a deleted or overwritten object is gone from every copy at once. Name your guarantees, and keep a backup that isn't the cluster.
B3. Pin and scan what you run [supply chain]
A tag is a movable label. :latest is the obvious case, but any tag, even a pinned-looking :v2.3.0, is a pointer the publisher can repoint, which makes every pull a standing grant to run code you've never reviewed. Only an image digest names exact bytes that can't be moved. Pin to the digest, scan what you pull for known CVEs, and record the versions actually running so you can answer "what's on the box" without guessing. The cost is that updates stop arriving for free, which is the point: you bump the digest on purpose, after a look.
We scan with Grype, an open-source scanner that matches a pulled image against public CVE feeds. Any scanner that fails loudly on a known-vulnerable package does the job; what matters is that nothing reaches the box unscanned.
Pin the digest, scan it, write the version down.
C tier
Real, but lower-leverage: controls that don't move your blast radius much yet still earn their place. Hygiene, upkeep, and the detection that tells you when something got through. Skipping one of these won't sink you the way an S-tier miss will, but a box without them is one you can't see into and can't keep current.
C1. Ship an audit log, and prune it [detection]
You can't respond to what you never saw. Your reverse proxy's access log and Garage's own API logs are the trail of which key touched what, and when, so keep a copy off the box where a compromise can't erase its own tracks. Then prune it: an access log is full of client IPs, which is personal information you now hold, so set a retention window and keep to it. Enough to investigate, no longer than you need.
C2. Keep the box patched
Hardening is a snapshot. The box drifts out of it as new CVEs land against packages that were clean the day you set it up. Turn on automatic security updates (unattended-upgrades on Debian/Ubuntu) so the known holes close without you babysitting them, and reboot on a schedule to pick up the kernel fixes.
C3. Slow the brute force at the edge [detection]
Anything with SSH on the public internet gets a steady drip of credential-guessing within hours. Key-only SSH is the real defense: passwords off, root login off. fail2ban is the cheap layer on top, banning an address after a few failures so the slow guessers give up and stop filling your logs. It won't stop a targeted attacker, but it clears the background radiation so the log lines that matter stay legible.
C4. Set CORS for browser apps, and gate access with SigV4
CORS feels like a security control and isn't one. It's a browser policy: it tells a browser which origins may make a cross-origin request, nothing more. On a signed S3 endpoint the real gate is SigV4, a private object needs a valid signature and a cross-origin page can't produce one, so a permissive CORS rule never exposes it; public objects are already public. Set CORS wide enough for your browser apps and no wider, and let the signature do the gating.
D tier
Theater: controls that feel like security and don't change what an attacker already on your box can reach. Most are real engineering for a different problem. They land in D only when they're sold as the thing they aren't.
Full-disk encryption, against a running box. LUKS and friends protect data on a powered-off or physically stolen disk. On a running server the volume is already unlocked and mounted, so root, the operator, and anything that takes root read straight through it as plaintext. FDE is real protection against a stolen or decommissioned drive and none at all against access to the live machine. If your threat is the operator or a host compromise, that's B1, not this.
RAID as a security control. RAID survives a dead disk, and that's all it does. An attacker, a bad write, a fat-fingered delete, ransomware: RAID faithfully replicates every one of them across all the disks at once. It's a durability and availability layer (B2), worth having, and it is neither a security control nor a backup. Counting it as either is the mistake.
Security through obscurity. SSH on port 2222, the admin panel at a secret path, a stack you won't disclose. None of it stops anyone who looks: scanners find the port in seconds, and the secret URL leaks the first time it lands in a log or a Referer header. Obscurity can cut drive-by noise, which is fine, as long as you never log it in the security column. And it cuts against you here specifically: the sovereignty case rests on a stack you can inspect, and you can't audit what you've hidden from yourself.
Trusting the LAN. "It's only reachable from the internal network" assumes the perimeter holds and everything inside it is friendly. One compromised container or one pivoted host turns that soft interior wide open. It's the reason S2 and S3 bind to loopback and dispatch through an allowlist rather than trust a network segment to stay clean: the enforceable boundary is the process and the request it will accept, and a subnet isn't one.
F tier
Footguns: configs that quietly open the hole while everything looks fine. The dangerous ones ship in setups people call production-ready.
Secrets inline in garage.toml. The inverse of A1. An inline rpc_secret or admin_token inherits the config's whole readership: the CLI, a mounted admin GUI, the careless git add. The file you treat as configuration is now a credential, and it leaks like one. Use the *_file forms.
GARAGE_ALLOW_WORLD_READABLE_SECRETS=true. Garage refuses to start when a secret file is world-readable. This env var turns that refusal off. It gets set to make a startup error disappear, and shipped, which silences the one check that would have caught a 644 on your cluster key. Fix the mode, never the check.
Rootful Docker punching through the firewall. The rootful daemon writes its own iptables rules and bypasses ufw, so -p 0.0.0.0:3900:3900 is reachable from the internet while ufw status reads locked down. You believe you're default-deny; Docker opened the port underneath you. Rootless (S1) sidesteps it entirely. If you stay rootful, know that your published ports ignore the firewall.
s3_api.root_domain that swallows your endpoint host. This one bites availability, not security, but it's the Garage config trap that cost us the most. If the S3 endpoint host is itself a subdomain of root_domain, Garage reads the endpoint's own label as a bucket name, and every signed call 404s while the CLI looks fine. Path-only stacks don't need root_domain; leave it unset.
Everything here ranks by one thing because everything here does one thing: it shrinks what a single compromise reaches, or it doesn't. The smaller that reachable surface, the shorter this list needs to be, which is why simplicity is itself a control. Fewer moving parts, fewer privileged components, a stack you can read end to end. You can only secure a system you can actually see.
None of it is set-and-forget. A box you hardened in May is not hardened in November: a CVE lands, a config drifts, a restart rebinds a socket to 0.0.0.0. The initial pass ages out, and the re-check is what keeps the claim true. That is what the Check Your Box commands are for. Run them again next quarter, not just today.
If you're running Garage yourself, take the list top down and re-run the checks on a cadence. If you'd rather not run object storage at all, that's what we built Storm Buckets for. Sign up and try it out.