404inc Stack Infrastructure Layer

Infrastructure as a contract, not a screenshot.

If your platform's topology lives in someone's head, in a console, or in a Slack channel from 2022 — you don't have infrastructure, you have liabilities. We treat the platform substrate the same way we treat code: declarative, reviewed, version-controlled, and reproducible from a clean checkout.

The platform substrate, in production.

A live snapshot from one of our multi-region deployments. Every node is observed. Every change is in git.

/ DEPLOYS / DAY
11
▲ Across 14 regions
/ EDGE P50
22ms
▼ Global · Cloudflare
/ MTTR
6.2min
▼ Mean time to recovery
/ UPTIME · 90d
99.997%
3.1 min total downtime

When we reach for which platform.

Topology is contract design. We pick the platform by where the workload actually lives, the latency the user actually feels, and the cost of migrating off five years from now.

PLATFORM
SCALE CEILING
OPS COMPLEXITY
EDGE LATENCY
VENDOR LOCK
VERDICT
Kubernetes
USE · AT SCALE
Cloudflare Workers
USE · EDGE
Fly.io
USE · MID
AWS ECS / Fargate
USE
AWS Lambda
WATCH
Heroku
RESCUE ONLY
Bare Metal
SPECIAL CASE

Forty years of compute substrates.

From racks under the desk to functions at the edge. Each shift collapsed a class of problem and surfaced a new one. We've operated through every era — and remember what each one cost.

1995 — 2008
The bare metal era
Capacity was a procurement problem. Servers had names. Migrations took weekends. The good lessons: predictable performance, intimate knowledge of the hardware. The bad: a single rack failure took the product down.
RackspaceColoApachenginx
2006 — 2014
The virtualization turn
EC2 made compute fungible. We stopped naming servers and started cattle-tagging them. Auto-scaling groups, AMIs, security groups — primitive, durable, still in use today.
EC2VMwareChefPuppet
2014 — 2020
The container consolidation
Docker collapsed "works on my machine." Kubernetes won the orchestration war by being thoroughly unopinionated. We built infrastructure-as-code into our DNA — Terraform across every account, Helm at the application boundary.
DockerKubernetesTerraformHelm
2018 — NOW
The serverless & edge era
Latency stopped being a backend problem and became a geography problem. Cloudflare Workers, Fly.io, and Lambda@Edge let us push compute within 50ms of every user on earth. The trade-off: smaller execution windows, stricter contracts.
CF WorkersFly.ioLambdaVercel
2024 — NOW
The declarative platform era
The platform itself is now a product. Internal Developer Platforms, GitOps, and OpenTelemetry across every layer. The infra team's job stopped being to provision and started being to design abstractions teams actually want to use.
PulumiCrossplaneArgoCDOTel

Anatomy of a global edge.

A live view of one of our production platforms. Fourteen regions, three failover paths each, every route observed in real time.

iad sfo gru lhr fra jnb bom sin nrt syd ▸ 14 REGIONS · ALL HEALTHY live route topology · updated 12s ago p50 / 22ms · p99 / 180ms 2.3M req/s sustained // healthy healthy degraded · 0 offline · 0

Six months of shipping.

Every cell is a day. Color saturation is deploys to production. The dark squares are weekends — but only when nothing was on fire.

less more — 1,847 deploys · 6mo · 0 emergency rollbacks

How we ship the infrastructure layer.

Four principles that hold across every platform we operate.

/ 01
Everything is in git, or it doesn't exist.
Terraform for infrastructure, Helm for workloads, ArgoCD for state reconciliation. If you can't reproduce the production environment from a clean checkout, you don't have an environment — you have a museum.
/ 02
Observability before deployment.
Every service ships with traces, structured logs, and metrics from commit zero. OpenTelemetry across every layer. The first deploy emits telemetry. The hundredth deploy gets to be unremarkable because of it.
/ 03
Failover is rehearsed, not assumed.
We run game days quarterly. We pull regions offline on purpose. We chaos-test the database failover. The first real outage is never the time to learn that the runbook references an Atlassian space that was deleted in 2023.
/ 04
The platform is a product.
Application teams should never write a Helm chart from scratch. We build internal developer platforms that hide complexity behind APIs the team actually wants to use. The ops budget compounds when the platform compounds.

In the field: JVN Network.

Multi-region edge infrastructure that doesn't blink at 340k concurrent connections.

JVN Network's contract was unforgiving: 14 regions, sub-50ms global p50, cryptographically signed state replication, and zero tolerance for split-brain on identity. The infrastructure decisions made in week one would either compound for years or surface as recurring outages every quarter.

We built it on three layers. Cloudflare Workers handled edge ingress — DDoS absorption, request validation, regional routing — within 22ms p50 of any user on earth. Fly.io ran the per-region application tier, with automatic primary election and connection-draining failover. AWS EKS hosted the central control plane, where consistency mattered more than latency.

The platform has shipped 11 deploys per day for two consecutive quarters with zero emergency rollbacks. That's the metric that compounds.

Every layer is Terraform-defined. Every workload is GitOps-reconciled. When a region needs to be added, a single PR adds it. When a region needs to be drained, the runbook is the same procedure rehearsed monthly in staging.

/ INFRASTRUCTURE METRICS · JVN NETWORK

REGIONS14
EDGE P5022ms
DEPLOYS / DAY11
EMERGENCY ROLLBACKS0
UPTIME · 90D99.997%
MTTR6.2min

Build a platform that operates itself.

If your infrastructure is a screenshot, a Notion page, or a manually-edited Helm chart — this is the conversation we open with.

Architect with us → View other layers