My GCP Compute VM times out: it's likely a CloudNAT problem

TL;DR, go straight to solutions

Throughout my entire carreer I always considered working for startups a risky business. Recently I joined one, simply because working for big tech in the times of a world crisis ensures no job security either. Shifting focus to other perks of startups, most people emphasize agility. I woulnd't call it an advantage I could appreciate myself: I'm terribly bad at being agile; also working mostly in infrastructure operations I have enough stress and context switching to deal with. Instead, I like the other aspect: the exposure to the whole infrastructure stack, with all possible variations of tech implementation. This is how I ensure I'll learn a lot, and I'll learn it quickly.

Problem

Now back to the meat and potatoes. My first task at my new job(at the time of writing) was to setup an in-house GitHub Actions runner, so we could have faster and more predictable builds at a reasonable price. The lift and shift approach mostly worked, except one nasty thing: helmfile diff would never run until completion and eventually time out.

I should mention first that the VM the GitHub Actions runner is supposed to run on resides in a separate GCP account and runs helm against GKE clusters in other projects(dev, staging, production).

Next 3 hours I spent in a desperate attempt to identify the root cause of the problem. First thing to supect was some kind of a distro-bound bug, and indeed, spawning a new machine running Debian 12 (vs Ubuntu 22.04 on the initial setup) took an immediate effect. However, replacing the image in a Terraform manifest resulted in the same failure as before: a CI pipeline breaks with no real evidence of where do timeouts come from.

My second guess was a firewall rule applied to a Kubernetes API endpoint, but then running curl in a loop against my own server outside of GCP's VPC yielded the same result: after 64th call I was hitting a timeout. But why exactly 64? After reading the CloudNAT documentation everything started making sense.

Possible solutions

The most obvious one: assign a public ip address to the VM. That will eliminate the need in CloudNAT, as well as its shortcomings.
If you absolutely have to keep your VMs behind CloudNAT, you have an option to adjust port ip-port pairs binding timeouts. You can do this by applying following Terraform code:

module "cloud-nat" {
  source        = "terraform-google-modules/cloud-nat/google"

  ...
  # See https://cloud.google.com/nat/docs/ports-and-addresses for more details
  # on circumventing CloudNAT limitations
  enable_dynamic_port_allocation  = true
  tcp_transitory_idle_timeout_sec = 1
  tcp_time_wait_timeout_sec       = 1
  max_ports_per_vm                = 1024
  min_ports_per_vm                = 32
}

That's it! I hope this memo saved you some brain cells, grey hair, and precious sleep. If you have a suggesstion, do not hesitate to @GitHub me.

Wed, 12 Jun 2024 22:38:56 +0200

RSS // Telegram // Статистика