kivikakk.ee

conclusion: patch OpenSSH!

John Goerzen’s Easily Using SSH with FIDO2/U2F Hardware Security Keys came up yesterday, and I thought it was a good time to fix my mess of private keys. I already own a YubiKey 5C Nano, which sits in my laptop at all times, as well as a 5C NFC, which I figured I could use hopefully with both my phone (NFC) and tablet (USB-C) for SSH when needed.

The ideal was to drop all non-SK keys, and use move to using agent forwarding exclusively when authenticating between hosts — rarely needed, but nice for some remote-to-remote scps or git+ssh pushes. (Agent forwarding is traditionally frowned upon, since someone who has or gains access to your VPS can use your socket to get your agent to auth things, but that issue is greatly reduced when user presence is verified on each use, viz. requiring you to touch your key.)

Turns out all was pretty much that easy! Just two minor hiccups:

  • Terminus on iOS supports FIDO2 keys, no payment required (despite what some search results say; looks like it was maybe a paid-only feature during beta but since not). Non-resident Ed25519 keys work very well over NFC on iPhone, but not over USB-C on iPad. The only reference I can find is this from their “ideas” page:

     Unfortunately, iPads and iPhones with USB-C cannot be compatible with OpenSSH-generate FIDO2-based keys. Please generate new FIDO2-based keys in the Termius app. These keys are supported in OpenSSH and all Termius apps.

    Upon testing, Terminus generates a non-resident ECDSA, and that works just great. So, in the end, I have three private keys: an Ed25519 for the 5C Nano, and an Ed25519 and ECDSA for the 5C NFC for use with NFC and USB-C respectively.

  • The OpenSSH bundled in macOS (at time of writing, OpenSSH_9.9p2, LibreSSL 3.3.6) doesn’t support the use of these keys. I haven’t checked whether it’s non-resident SKs specifically or what, or whether it’s the version or just a matter of what support is compiled in.

    NixOS/nix-darwin 25.05 carries an OpenSSH_10.0p2, OpenSSL 3.4.1 11 Feb 2025, and it does!

    Using agent forwarding without losing what’s left of one’s humanity implies getting your ssh-agent setup working nicely. How?

I looked into a few different ways, but opted for the simplest: patching OpenSSH (!?). The thought process is as follows:

  • /System/Library/LaunchAgents/com.openssh.ssh-agent.plist will put an SSH_AUTH_SOCKET in your environment, which launches the system-provided ssh-agent when first addressed.
  • This is a nice, macOS-y way to do things, but we don’t want to go down any rabbit hole that involves disabling SIP, so we can’t just swap the binary path (or binary itself) out.
    • Even if we could, we still need to add launchd support to the ssh-agent in Nix. Thankfully, the needed changes are small, and easy to find; just look for __APPLE_LAUNCHD__ in ssh-agent.c.
  • In my experience, disabling the system-provided launch agent doesn’t work reliably.

We apply two patches:

  1. Add the __APPLE_LAUNCHD__ bits into the Nix-provided OpenSSH.
  2. Change SSH_AUTHSOCKET_ENV_NAME to "VYX_SSH_AUTH_SOCK".

Finally, we install our own launchd user agent (modelled upon the system one, but with our binary), which puts the socket in the VYX_SSH_AUTH_SOCK env var instead. This means we don’t need to worry about the system launch agent; it’ll only get triggered/used when something calls the system ssh binary.

Head teed!

jackalgirls & CUE

Today it was finally time to write a policy file for one of my Anubis instances. I use Timoni as a fairly thin wrapper over CUE to write templates for my own k8s deployments, and I found it really shone in this particular instance. I’ll just tl;dr and show the code; here’s an excerpt from my blog engine’s bundle.cue, which is the “entrypoint” for compiling its manifests:

anubis: {
secretName: "anubis-20250816-071240"
policy: permitPaths: [{
name: "permit-atom-xml"
path_regex: "^/atom\\.xml$"
}, {
name: "permit-feed-xml"
path_regex: "^/feed\\.xml$"
}]
}

I’m aiming to expose just a minimum of configurability first. Here’s how the schema side of that is defined in config.cue:

anubis?: {
// Needs to already exist in the target namespace. Should have key
// "ED25519_PRIVATE_KEY_HEX".
secretName: string
policy?: {
permitPaths: *[] | [... close({
name: string
path_regex: string
})]
}
}

I grabbed the default root bot policy file from https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.yaml, and converted it to CUE with cue import botPolicies.yaml. Then we put it in the templates package, add a way to inject our config, and use the config to expand upon the defaults:

package templates
#AnubisBotPolicies: {
#config: #Config
//# Anubis has the ability to let you import snippets of configuration into the main
//# configuration file. This allows you to break up your config into smaller parts
//# that get logically assembled into one big file.
// ...
}, if #config.kv.anubis.policy.permitPaths != _|_ for setting in #config.kv.anubis.policy.permitPaths {
name: setting.name
path_regex: setting.path_regex
action: "ALLOW"
}, {
// ...

Finally, the bit I really like: creating the ConfigMap (which gets mounted as a volume) with the policy YAML:

#AnubisConfigMap: timoniv1.#ImmutableConfig & {
Config=#config: #Config
#Kind: timoniv1.#ConfigMapKind
#Meta: #config.metadata
#Suffix: "-anubis-env"
#Data: {
"policy.yml": yaml.Marshal(#AnubisBotPolicies & {#config: Config})
}
}

Note the careful lack of hand-written YAML at any stage! 💛🤍💜🖤

the Anubis character, by CELPHASE

lock tf in!!

Drawing of my fursona looking exceptionally Bunny (a bit lost), with the text “lock tf in!!” next to it.

By the wonderful Kinu; base from DeviantArt.

just keep looking deeper! the answer is in there!

La herramienta del día es nftrace. I couldn’t work out why some pods weren’t able to communicate with each other across the Tailscale mesh. Suspected ACLs, suspected routes weren’t getting installed correctly (p.s. ip route show table 52 (!?)), suspected local firewalls, suspected so much. tcpdump only gets you so far.

Finally, on the target node:

$ doas -s
# nix shell nixpkgs#nftrace nixpkgs#nftables
# nftrace add ip daddr 10.59.1.213
# nftrace monitor

Try the request that isn’t making it through a bunch of times until you can isolate the exact sequence. ^C, nftrace remove, and read carefully:

trace id daac839a inet nftrace-table nftrace-chain packet: iif "tailscale0" ip saddr
100.67.157.26 ip daddr 10.59.1.213 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 32261
ip protocol tcp ip length 60 tcp sport 33233 tcp dport 9090 tcp flags == syn tcp
window 64480
trace id daac839a inet nftrace-table nftrace-chain rule ip daddr 10.59.1.213 meta nftrace
set 1 (verdict continue)
trace id daac839a inet nftrace-table nftrace-chain policy accept
trace id daac839a ip filter FORWARD packet: iif "tailscale0" oif "cni0" ip saddr 100.67.157.26
ip daddr 10.59.1.213 ip dscp cs0 ip ecn not-ect ip ttl 63 ip id 32261 ip length 60 tcp
sport 33233 tcp dport 9090 tcp flags == syn tcp window 64480
trace id daac839a ip filter FORWARD rule counter packets 44827 bytes 28768164 jump
KUBE-ROUTER-FORWARD (verdict jump KUBE-ROUTER-FORWARD)
trace id daac839a ip filter KUBE-ROUTER-FORWARD rule ip daddr 10.59.1.213 counter packets
5001 bytes 6279235 jump KUBE-POD-FW-FIAOHC4WHRKERAQ6 (verdict jump
KUBE-POD-FW-FIAOHC4WHRKERAQ6)
trace id daac839a ip filter KUBE-POD-FW-FIAOHC4WHRKERAQ6 rule counter packets 5 bytes 300
jump KUBE-NWPLCY-ZYSQVVSY5LQY7Q46 (verdict jump KUBE-NWPLCY-ZYSQVVSY5LQY7Q46)
trace id daac839a ip filter KUBE-NWPLCY-ZYSQVVSY5LQY7Q46 rule limit rate 10/minute burst 10
packets meta mark & 0x00010000 != 0x00010000 counter packets 5 bytes 300 log prefix
"DROP by policy monitoring/prometheus-k8s" group 100 (verdict continue)
trace id daac839a ip filter KUBE-POD-FW-FIAOHC4WHRKERAQ6 rule meta mark & 0x00010000 !=
0x00010000 limit rate 10/minute burst 10 packets counter packets 5 bytes 300 log group
100 (verdict continue)
trace id daac839a ip filter KUBE-POD-FW-FIAOHC4WHRKERAQ6 rule meta mark & 0x00010000 !=
0x00010000 counter packets 5 bytes 300 reject (verdict drop)

What’s that? log prefix "DROP by policy monitoring/prometheus-k8s"?? Guuaaaaauuuuu.

devops? devops! (part 1)

This is part 1 of x in a series.

I have spent most of my life avoiding DevOps-y type things. At GitHub I got familiar enough with kubectl to help debug the applications I had deployed on it, but that was almost a decade ago and I don’t remember a single bit of it.

Most of the things I run I deploy with a really simple systemd unit definition in the Nix module. Here’s an excerpt from the one for the Elixir app this blog ran on:

{
systemd.services.kv = {
description = "kv";
enableStrictShellChecks = true;
wantedBy = [ "multi-user.target" ];
after = [ "kv-migrations.service" ];
requires = [
"postgresql.service"
"kv-migrations.service"
];
script = ''
export KV_STORAGE_ROOT="$STATE_DIRECTORY"
${envVarScript}
${cfg.package}/bin/kv-server
'';
serviceConfig = {
User = cfg.user;
ProtectSystem = "strict";
PrivateTmp = true;
UMask = "0007";
Restart = "on-failure";
RestartSec = "10s";
StateDirectory = "kv";
StateDirectoryMode = "0750";
};
inherit environment;
};
}

It’s very basic, and it worked beautifully! I love that, with NixOS, you can package a reproducible build (with all its dependencies), deployment strategy, and configuration schema all in one place. It’s so damn clean, and it works wonderfully for homelab- or personal services-scale systems. (For more, try Xe’s All Systems Go! talk, Writing your own NixOS modules for fun and (hopefully) profit.)

The downside is that this is not exactly a high-availability setup. When any of the dependencies of a service like this change — such as a new cfg.package, or change in environment — the result is that the existing service is stopped, the service is swapped out, and then the new one is started.

There can often be 10–30 seconds between the stop and start, depending on how much else the nixos-rebuild has to do. And while a failing build won’t leave you with a stopped service — you won’t even get that far — if the build succeeds, but the new service fails to come up for some reason, then you’ll be scrambling fast.

This being NixOS, getting your service back up is as easy as switching to the previous generation, and can be done very fast, but still, it’s not great. Realising this, and still very much wanting to use Nix as a build orchestrator in places where this isn’t an acceptable trade-off, it was time to learn a devops.


Structurally, Kubernetes seems relatively sound, giving us language for defining the shape of a deployed system upon many different axes. It is very YAML and it is very containers, neither of which I am the hugest fan of, but I felt pretty sure there would be tools to help with the former, and Nix my beloved has beautiful solutions for the latter.

If, like me before the start of this exercise, you don’t really know about the model Kubernetes gives you to work with, you might find useful David Ventura’s blog post, A skeptic’s first contact with Kubernetes. If I had found it before and not immediately after coming this far it would’ve been super helpful -_-

One thing worth mentioning is that, as a Very Nix Person (and Very Dissociated Person), I really need my infrastructure to be described in a version-controlled way. Ideally, I would be able to tie all of my infra back into the same place (which is vyx, a Nix flake).

So I decide to start up a cluster and begin experimenting. I hate Docker, Inc. with a passion — I will never forgive them for getting rid of Docker for Mac’s cute whale — plus I want to learn somewhere where I can actually deploy things, so I decide to start with k3s on my VPS. How I chose k3s to begin with, I’m not so sure — maybe because it has relatively few options exposed in its Nix module. Lightweight sounds good, and it’s a “certified Kubernetes distribution”. Whatever that means, it must be good!

NixOS has the option services.k3s.manifests, which is described as “auto-deploying manifests”. Perhaps this is the magic sauce I need to get my infrastructure as code!?

(The answer is, no, it isn’t — the entire cluster is restarted when you change its values, because NixOS. Teehee.)

Nonetheless, I struggled through writing some early manifests this way. Writing YAML in Nix is way better than writing YAML, and very easy to parameterise, extract functions, and so on. I had seen mention of Helm charts here and there, and while I felt like one day I would need to come to terms with them, I preferred to leave that until as late an opportunity as possible. As a bonus, using k3s auto-deploying manifests in this way meant I could write a NixOS module to deploy an application in Kubernetes, without a single line of raw YAML.

So, terrible in many respects — now bringing down an entire cluster on each change instead of just the relevant services (!!!) — but an introduction nonetheless. We are now at the point of siguiente:

  • Decided to turn that homelab server into a gaming PC instead, haha psyche! Instead decided to learn better how to cross-build things and operate k3s without trying to shove everything through a NixOS module.

Part 2 will cover building our own software ready for orchestration (using Nix — we won’t write a single Dockerfile, promise, and as a little bit of a spoiler, we won’t write a single Go template either), and the unique fun presented by developing on aarch64-darwin while largely deploying to x86_64-linux. :)