Infrastructure Layer

The platform substrate, in production.

A live snapshot from one of our multi-region deployments. Every node is observed. Every change is in git.

/ DEPLOYS / DAY

▲ Across 14 regions

/ EDGE P50

22ms

▼ Global · Cloudflare

/ MTTR

6.2min

▼ Mean time to recovery

/ UPTIME · 90d

99.997%

3.1 min total downtime

When we reach for which platform.

Topology is contract design. We pick the platform by where the workload actually lives, the latency the user actually feels, and the cost of migrating off five years from now.

PLATFORM

SCALE CEILING

OPS COMPLEXITY

EDGE LATENCY

VENDOR LOCK

VERDICT

Kubernetes

USE · AT SCALE

Cloudflare Workers

USE · EDGE

Fly.io

USE · MID

AWS ECS / Fargate

USE

AWS Lambda

WATCH

Heroku

RESCUE ONLY

Bare Metal

SPECIAL CASE

Forty years of compute substrates.

From racks under the desk to functions at the edge. Each shift collapsed a class of problem and surfaced a new one. We've operated through every era — and remember what each one cost.

1995 — 2008

The bare metal era

Capacity was a procurement problem. Servers had names. Migrations took weekends. The good lessons: predictable performance, intimate knowledge of the hardware. The bad: a single rack failure took the product down.

RackspaceColoApachenginx

2006 — 2014

The virtualization turn

EC2 made compute fungible. We stopped naming servers and started cattle-tagging them. Auto-scaling groups, AMIs, security groups — primitive, durable, still in use today.

EC2VMwareChefPuppet

2014 — 2020

The container consolidation

Docker collapsed "works on my machine." Kubernetes won the orchestration war by being thoroughly unopinionated. We built infrastructure-as-code into our DNA — Terraform across every account, Helm at the application boundary.

DockerKubernetesTerraformHelm

2018 — NOW

The serverless & edge era

Latency stopped being a backend problem and became a geography problem. Cloudflare Workers, Fly.io, and Lambda@Edge let us push compute within 50ms of every user on earth. The trade-off: smaller execution windows, stricter contracts.

CF WorkersFly.ioLambdaVercel

2024 — NOW

The declarative platform era

The platform itself is now a product. Internal Developer Platforms, GitOps, and OpenTelemetry across every layer. The infra team's job stopped being to provision and started being to design abstractions teams actually want to use.

PulumiCrossplaneArgoCDOTel

How we ship the infrastructure layer.

Four principles that hold across every platform we operate.

/ 01

Everything is in git, or it doesn't exist.

Terraform for infrastructure, Helm for workloads, ArgoCD for state reconciliation. If you can't reproduce the production environment from a clean checkout, you don't have an environment — you have a museum.

/ 02

Observability before deployment.

Every service ships with traces, structured logs, and metrics from commit zero. OpenTelemetry across every layer. The first deploy emits telemetry. The hundredth deploy gets to be unremarkable because of it.

/ 03

Failover is rehearsed, not assumed.

We run game days quarterly. We pull regions offline on purpose. We chaos-test the database failover. The first real outage is never the time to learn that the runbook references an Atlassian space that was deleted in 2023.

/ 04

The platform is a product.

Application teams should never write a Helm chart from scratch. We build internal developer platforms that hide complexity behind APIs the team actually wants to use. The ops budget compounds when the platform compounds.

In the field: JVN Network.

Multi-region edge infrastructure that doesn't blink at 340k concurrent connections.

JVN Network's contract was unforgiving: 14 regions, sub-50ms global p50, cryptographically signed state replication, and zero tolerance for split-brain on identity. The infrastructure decisions made in week one would either compound for years or surface as recurring outages every quarter.

We built it on three layers. Cloudflare Workers handled edge ingress — DDoS absorption, request validation, regional routing — within 22ms p50 of any user on earth. Fly.io ran the per-region application tier, with automatic primary election and connection-draining failover. AWS EKS hosted the central control plane, where consistency mattered more than latency.

The platform has shipped 11 deploys per day for two consecutive quarters with zero emergency rollbacks. That's the metric that compounds.

Every layer is Terraform-defined. Every workload is GitOps-reconciled. When a region needs to be added, a single PR adds it. When a region needs to be drained, the runbook is the same procedure rehearsed monthly in staging.

/ INFRASTRUCTURE METRICS · JVN NETWORK

REGIONS14

EDGE P5022ms

DEPLOYS / DAY11

EMERGENCY ROLLBACKS0

UPTIME · 90D99.997%

MTTR6.2min

Infrastructure as a contract, not a screenshot.

The platform substrate, in production.

When we reach for which platform.

Forty years of compute substrates.

Anatomy of a global edge.

Six months of shipping.

How we ship the infrastructure layer.

In the field: JVN Network.

/ INFRASTRUCTURE METRICS · JVN NETWORK

Build a platform that operates itself.