The Developer Tax Problem
In engineering organisations that have grown beyond a handful of teams, a hidden cost accumulates: the developer tax. Before writing a single line of product code, an engineer joining a new initiative must set up CI/CD pipelines, request cloud credentials, configure logging agents, wire up alerting, provision a database, and understand six different Slack channels to find the right person to ask for access to a staging environment. This friction — paid repeatedly, by every team, for every new service — is the developer tax.
The 2023 DORA State of DevOps Report identified platform engineering as the highest-leverage investment for improving software delivery performance. Organisations with mature internal developer platforms (IDPs) deploy 2.5× more frequently and have 2× lower change failure rates than peers without one. The business case is straightforward: reduce the developer tax, and you reclaim engineer-hours for product work.
- Cognitive overhead. Each team independently learns Terraform, Helm, ArgoCD, and the org's cloud policies. Identical knowledge is acquired by dozens of squads instead of being centralised and reused.
- Inconsistent standards. Without a paved road, every team designs their own approach to secrets management, log formats, and on-call wiring — creating a long tail of non-standard systems that are painful to observe and operate.
- Hidden SRE bottlenecks. SRE or infra teams become a manual approval gate for every new environment, database, or Kubernetes namespace — blocking product velocity while creating a constant toil backlog.
Platform engineering solves this by treating the internal developer platform as a product: opinionated, self-service, built for the developer experience, and measured by adoption rather than just uptime.
Note
IDP Maturity Model — Four Stages
Internal developer platforms evolve through recognisable stages. Knowing your current stage helps you prioritise investment and set realistic expectations for what an IDP can deliver.
# IDP Maturity Model
Stage 1 — WIKI & RUNBOOKS
Capability: Documentation-only. Teams follow written guides to provision
infrastructure manually or request it via tickets.
Bottleneck: Human gates. Every new environment requires a ticket reviewed
by a central team. Lead time measured in days.
Indicator: "Send a request to the infra Slack channel and someone will help."
Stage 2 — SCRIPTS & AUTOMATION
Capability: Shell scripts, Makefiles, or one-off Terraform configurations
checked into individual repos. Automation exists but is not
standardised or shared.
Bottleneck: Copy-paste culture. Each team maintains their own variant of
the same script. Inconsistencies accumulate silently.
Indicator: "I'll share our pipeline YAML — just adapt it for your repo."
Stage 3 — GOLDEN PATHS & SELF-SERVICE
Capability: Curated Terraform modules, Helm charts, CI/CD templates, and
service catalog scaffolding. Teams consume standard building
blocks through self-service interfaces. No ticket required
for standard workloads.
Bottleneck: Adoption. Teams with legacy setups resist migration. The golden
path must be more attractive than the alternative, not
just mandated.
Indicator: "Run this CLI command and your service will be deployed
in < 10 minutes."
Stage 4 — PRODUCT-GRADE IDP
Capability: Full developer portal (Backstage), programmatic APIs for
infrastructure provisioning (Crossplane, Terraform Cloud API),
DORA metrics dashboards, internal SLAs, and a product roadmap
driven by developer NPS surveys.
Bottleneck: Organisational maturity. Platform team must act as a product
team — prioritising features, managing backlog, measuring
adoption — not as an infrastructure team that reacts to
requests.
Indicator: "Our DORA deploy frequency is tracked in the portal and
our platform NPS is 42."Golden Paths — Not Golden Cages
Spotify popularised the term "golden path" — a well-maintained, opinionated way to build and deploy a service that is faster and safer than building from scratch. The crucial distinction between a golden path and a golden cage is optionality: teams should be able to diverge from the golden path when they have a legitimate reason, without filing a support ticket.
A practical golden path for a microservice typically covers:
- Repository scaffold. A Backstage Software Template that generates a repo with CI/CD, Dockerfile, Helm chart, Prometheus metrics endpoint, and structured logging pre-wired.
- Infrastructure module. A curated Terraform module that provisions a namespace, ServiceAccount, NetworkPolicy, HorizontalPodAutoscaler, and PodDisruptionBudget with sensible defaults.
- Deployment pipeline. A reusable GitHub Actions workflow or ArgoCD ApplicationSet that handles image build, image signing, vulnerability scanning, and promotion across staging and production environments.
- Observability baseline. Grafana dashboard, alerting rules, and log parsing config shipped automatically with every new service — teams get SLO dashboards on day one.
Backstage — Developer Portal and Service Catalog
Backstage is Spotify's open-source developer portal, donated to the CNCF in 2020. It provides three core capabilities that form the foundation of a production IDP: a software catalog, TechDocs for documentation, and Software Templates for scaffolding new services.
Service Catalog — catalog-info.yaml
Every service, library, API, and team is described in a catalog-info.yaml file checked into the repository root. Backstage discovers and indexes these descriptors, creating a searchable catalog of the entire software estate.
# catalog-info.yaml — checked into every service repo root
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payment-service
description: Handles card payment processing and refund workflows
annotations:
# Link to GitHub Actions CI
github.com/project-slug: acme/payment-service
# Link to Grafana dashboard
grafana/dashboard-selector: "service=payment-service"
# Link to PagerDuty escalation policy
pagerduty.com/service-id: P1234AB
# Link to ArgoCD application
argocd/app-name: payment-service-production
# OpenAPI spec location
backstage.io/techdocs-ref: dir:.
tags:
- python
- payments
- kafka
- pci-dss
links:
- url: https://grafana.internal/d/payment-slo
title: SLO Dashboard
icon: dashboard
- url: https://runbooks.internal/payment-service
title: Runbook
icon: help
spec:
type: service
lifecycle: production
owner: group:payments-team
system: checkout-platform
dependsOn:
- component:postgres-payments
- component:kafka-cluster
- resource:stripe-api
providesApis:
- payment-service-v2-apiSoftware Templates — Self-Service Scaffolding
Backstage Software Templates automate the creation of new services from a form-based UI. A developer fills in the service name, team owner, and database requirements; the template generates a repository with CI/CD, Helm chart, Dockerfile, and catalog-info.yaml pre-populated.
# template.yaml — Backstage Software Template for a Python microservice
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: python-microservice
title: Python Microservice
description: Scaffolds a production-ready Python service with Helm, CI/CD, and observability
tags:
- python
- recommended
spec:
owner: group:platform-team
type: service
parameters:
- title: Service Details
required: [name, owner, description]
properties:
name:
title: Service Name
type: string
pattern: '^[a-z][a-z0-9-]{2,49}$'
description: Lowercase, hyphens only (e.g. payment-service)
owner:
title: Owning Team
type: string
ui:field: OwnerPicker
ui:options:
catalogFilter:
kind: Group
description:
title: Description
type: string
maxLength: 200
enableDatabase:
title: Provision PostgreSQL database?
type: boolean
default: false
kafkaTopics:
title: Kafka topics to create (comma-separated)
type: string
default: ""
steps:
- id: fetch-template
name: Fetch base template
action: fetch:template
input:
url: ./skeleton
values:
name: ${{ parameters.name }}
owner: ${{ parameters.owner }}
description: ${{ parameters.description }}
enableDatabase: ${{ parameters.enableDatabase }}
- id: publish
name: Publish to GitHub
action: publish:github
input:
allowedHosts: ['github.com']
repoUrl: github.com?repo=${{ parameters.name }}&owner=acme-corp
defaultBranch: main
description: ${{ parameters.description }}
repoVisibility: private
topics: [${{ parameters.owner }}, python, microservice]
- id: register
name: Register in Catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yaml
output:
links:
- title: Repository
url: ${{ steps['publish'].output.remoteUrl }}
- title: Open in Catalog
url: ${{ steps['register'].output.catalogInfoUrl }}Note
action: step is a plugin — the built-in action library covers GitHub, GitLab, Bitbucket, and generic HTTP. Custom actions (e.g., creating a PagerDuty service, provisioning a Vault namespace) are TypeScript plugins registered at startup. All actions run with service account credentials stored in Kubernetes Secrets — never in the template YAML.Self-Service Infrastructure — Terraform Module Registry
The foundation of self-service infrastructure is a curated module registry. Rather than letting every team write ad-hoc Terraform, the platform team publishes opinionated modules that encode security, cost, and compliance defaults. Teams consume the module with a handful of input variables — they never touch the underlying cloud resources directly.
# modules/microservice-infra/main.tf
# Platform team module — teams consume this, not raw AWS resources
variable "service_name" {
type = string
description = "Slug used to name all resources (e.g. payment-service)"
validation {
condition = can(regex("^[a-z][a-z0-9-]{2,49}$", var.service_name))
error_message = "service_name must be lowercase alphanumeric with hyphens."
}
}
variable "team_name" { type = string }
variable "environment" { type = string } # staging | production
variable "replica_count" { type = number; default = 2 }
variable "cpu_request" { type = string; default = "250m" }
variable "memory_request" { type = string; default = "256Mi" }
variable "enable_hpa" { type = bool; default = true }
variable "enable_pdb" { type = bool; default = true }
locals {
labels = {
"app.kubernetes.io/name" = var.service_name
"app.kubernetes.io/managed-by" = "terraform"
"platform.acme.io/team" = var.team_name
"platform.acme.io/env" = var.environment
}
}
resource "kubernetes_namespace" "this" {
metadata {
name = "${var.service_name}-${var.environment}"
labels = local.labels
annotations = {
"platform.acme.io/owner" = var.team_name
"platform.acme.io/cost-center" = var.team_name
}
}
}
resource "kubernetes_service_account" "this" {
metadata {
name = var.service_name
namespace = kubernetes_namespace.this.metadata[0].name
labels = local.labels
annotations = {
# Enables IRSA — workload identity without static AWS credentials
"eks.amazonaws.com/role-arn" = aws_iam_role.service.arn
}
}
}
resource "kubernetes_network_policy" "default_deny" {
metadata {
name = "default-deny-ingress"
namespace = kubernetes_namespace.this.metadata[0].name
}
spec {
pod_selector {}
policy_types = ["Ingress"]
}
}
resource "kubernetes_horizontal_pod_autoscaler_v2" "this" {
count = var.enable_hpa ? 1 : 0
metadata {
name = var.service_name
namespace = kubernetes_namespace.this.metadata[0].name
}
spec {
scale_target_ref {
api_version = "apps/v1"
kind = "Deployment"
name = var.service_name
}
min_replicas = var.replica_count
max_replicas = var.replica_count * 5
metric {
type = "Resource"
resource {
name = "cpu"
target {
type = "Utilization"
average_utilization = 70
}
}
}
}
}
resource "kubernetes_pod_disruption_budget_v1" "this" {
count = var.enable_pdb ? 1 : 0
metadata {
name = var.service_name
namespace = kubernetes_namespace.this.metadata[0].name
}
spec {
min_available = "50%"
selector {
match_labels = { "app.kubernetes.io/name" = var.service_name }
}
}
}Teams consume the module with a minimal configuration file — they specify what they need, not how to build it:
# services/payment-service/infra/main.tf — owned by the product team
module "infra" {
source = "registry.terraform.acme.io/platform/microservice-infra/kubernetes"
version = "~> 3.0" # platform team manages breaking changes
service_name = "payment-service"
team_name = "payments"
environment = var.environment
replica_count = var.environment == "production" ? 3 : 1
enable_hpa = true
enable_pdb = true
}
# Optional: extend with service-specific IAM permissions
resource "aws_iam_role_policy_attachment" "stripe_secrets" {
role = module.infra.service_iam_role_name
policy_arn = data.aws_iam_policy.stripe_secrets_read.arn
}GitOps Self-Service with ArgoCD ApplicationSets
ArgoCD ApplicationSets extend ArgoCD with generators that dynamically create Application resources from templates. The Git generator scans a repository directory for configuration files and creates one ArgoCD Application per file — giving teams a pull-request-based, self-service deployment mechanism without direct ArgoCD API access.
# argocd/applicationset-services.yaml
# Scans apps/ directory — each subdirectory with a config.yaml becomes an ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: platform-services
namespace: argocd
spec:
goTemplate: true
generators:
- git:
repoURL: https://github.com/acme/platform-apps.git
revision: HEAD
directories:
- path: apps/*
template:
metadata:
name: '{{ .path.basename }}'
annotations:
notifications.argoproj.io/subscribe.on-sync-failed.slack: deployments-alerts
spec:
project: platform-services
source:
repoURL: https://github.com/acme/platform-apps.git
targetRevision: HEAD
path: '{{ .path.path }}'
helm:
valueFiles:
- values.yaml
- 'values-{{ index .path.segments 1 }}.yaml' # env-specific overrides
destination:
server: https://kubernetes.default.svc
namespace: '{{ .path.basename }}'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
retry:
limit: 3
backoff:
duration: 30s
factor: 2Teams onboard a new service by adding a directory to the apps/ directory via pull request. No ArgoCD admin access required, no ticket to the platform team. The ApplicationSet controller picks up the new directory and creates the Application automatically after the PR merges.
# apps/payment-service/Chart.yaml — team-owned Helm chart
apiVersion: v2
name: payment-service
version: 1.0.0
appVersion: "2.14.0"
dependencies:
- name: common
version: "~> 1.5"
repository: "https://charts.platform.acme.io" # platform-provided base chart
---
# apps/payment-service/values.yaml
replicaCount: 2
image:
repository: ghcr.io/acme/payment-service
pullPolicy: IfNotPresent
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
memory: 512Mi
service:
port: 8080
metricsPort: 9090
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
ingress:
enabled: true
className: nginx
host: payments.internal.acme.io
env:
LOG_LEVEL: info
OTEL_SERVICE_NAME: payment-serviceCrossplane — Kubernetes-Native Cloud Provisioning
Crossplane extends Kubernetes with custom resource definitions (CRDs) that represent cloud infrastructure: RDS instances, S3 buckets, Redis clusters, and more. The platform team defines Composite Resources (XRDs) — opinionated, org-specific abstractions over raw cloud resources. Product teams create instances of these composites in YAML, the same way they create Deployments.
# platform/xrd-postgres.yaml — Composite Resource Definition (XRD)
# Defines the abstract "PostgreSQL" resource that product teams request
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: xpostgresqlinstances.platform.acme.io
spec:
group: platform.acme.io
names:
kind: XPostgreSQLInstance
plural: xpostgresqlinstances
claimNames: # teams create PostgreSQLInstance (namespaced claim)
kind: PostgreSQLInstance
plural: postgresqlinstances
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: [parameters]
properties:
parameters:
type: object
required: [storageGB, tier]
properties:
storageGB:
type: integer
minimum: 20
maximum: 500
tier:
type: string
enum: [dev, standard, performance]
multiAZ:
type: boolean
default: false
---
# platform/composition-postgres.yaml — maps abstract spec to AWS RDS
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: xpostgresqlinstances.aws.platform.acme.io
spec:
compositeTypeRef:
apiVersion: platform.acme.io/v1alpha1
kind: XPostgreSQLInstance
resources:
- name: rds-instance
base:
apiVersion: rds.aws.upbound.io/v1beta1
kind: Instance
spec:
forProvider:
region: eu-west-1
engine: postgres
engineVersion: "16"
skipFinalSnapshot: false
deletionProtection: true
storageEncrypted: true
copyTagsToSnapshot: true
backupRetentionPeriod: 7
performanceInsightsEnabled: true
providerConfigRef:
name: aws-production
patches:
- type: FromCompositeFieldPath
fromFieldPath: spec.parameters.storageGB
toFieldPath: spec.forProvider.allocatedStorage
- type: FromCompositeFieldPath
fromFieldPath: spec.parameters.multiAZ
toFieldPath: spec.forProvider.multiAz
- type: FromCompositeFieldPath
fromFieldPath: spec.parameters.tier
toFieldPath: spec.forProvider.instanceClass
transforms:
- type: map
map:
dev: db.t3.micro
standard: db.t3.medium
performance: db.r6g.largeTeams request a database the same way they would create a Kubernetes Deployment — via a namespaced Claim YAML in their own namespace:
# In the payment-service namespace — created by the product team
apiVersion: platform.acme.io/v1alpha1
kind: PostgreSQLInstance
metadata:
name: payments-db
namespace: payment-service-production
spec:
parameters:
storageGB: 100
tier: standard
multiAZ: true
compositeDeletePolicy: Foreground
writeConnectionSecretToRef:
name: payments-db-connection # Crossplane writes creds here automaticallyNote
DORA Metrics — Measuring Platform Team Impact
The DORA four key metrics are the standard for measuring software delivery performance: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Failed Deployment Recovery Time. For a platform team, these metrics serve a dual purpose: measure the platform itself (how reliable is our golden path?) and measure the impact of the platform on product teams (did our self-service tooling increase their deploy frequency?).
# scripts/dora-report.py
# Generates DORA metrics from GitHub Actions and PagerDuty data
import requests
import statistics
from datetime import datetime, timedelta, UTC
from collections import defaultdict
GITHUB_TOKEN = "..." # from env
PD_TOKEN = "..." # from env
ORG = "acme-corp"
LOOKBACK_DAYS = 30
SINCE = (datetime.now(UTC) - timedelta(days=LOOKBACK_DAYS)).isoformat()
def get_production_deployments(repo: str) -> list[dict]:
"""Fetch successful production deploy workflow runs."""
url = f"https://api.github.com/repos/{ORG}/{repo}/actions/runs"
runs = []
page = 1
while True:
resp = requests.get(url, params={
"branch": "main",
"status": "success",
"event": "push",
"created": f">={SINCE}",
"per_page": 100,
"page": page,
}, headers={"Authorization": f"Bearer {GITHUB_TOKEN}"})
data = resp.json()
batch = [r for r in data.get("workflow_runs", [])
if r["name"] == "Deploy to Production"]
runs.extend(batch)
if len(data.get("workflow_runs", [])) < 100:
break
page += 1
return runs
def deployment_frequency(deployments: list[dict]) -> float:
"""Deployments per day over the lookback window."""
return len(deployments) / LOOKBACK_DAYS
def lead_time_hours(deployments: list[dict]) -> float:
"""Median hours from commit to production deployment."""
lead_times = []
for run in deployments:
created = datetime.fromisoformat(run["created_at"].replace("Z", "+00:00"))
# head_commit timestamp = when the triggering commit was authored
committed = datetime.fromisoformat(
run["head_commit"]["timestamp"].replace("Z", "+00:00")
)
lead_times.append((created - committed).total_seconds() / 3600)
return statistics.median(lead_times) if lead_times else 0
def change_failure_rate(deployments: list[dict], incidents: list[dict]) -> float:
"""Percentage of deployments that caused a production incident."""
if not deployments:
return 0.0
# Map incidents to deployments within 1-hour window
failures = 0
for incident in incidents:
inc_time = datetime.fromisoformat(
incident["created_at"].replace("Z", "+00:00")
)
for dep in deployments:
dep_time = datetime.fromisoformat(dep["created_at"].replace("Z", "+00:00"))
if timedelta(0) <= (inc_time - dep_time) <= timedelta(hours=1):
failures += 1
break
return (failures / len(deployments)) * 100
def print_dora_report(repo: str) -> None:
deps = get_production_deployments(repo)
df = deployment_frequency(deps)
lt = lead_time_hours(deps)
print(f"\n=== DORA Report: {repo} (last {LOOKBACK_DAYS} days) ===")
print(f" Deployment Frequency : {df:.1f} deploys/day")
print(f" Median Lead Time : {lt:.1f} hours")
# Change Failure Rate requires PagerDuty data (omitted for brevity)
# Classify against DORA elite thresholds
freq_tier = "Elite" if df >= 1 else "High" if df >= 0.14 else "Medium" if df >= 0.03 else "Low"
lt_tier = "Elite" if lt < 1 else "High" if lt < 24 else "Medium" if lt < 168 else "Low"
print(f" Deploy Freq Tier : {freq_tier}")
print(f" Lead Time Tier : {lt_tier}")Platform Team as Product Team — Organisational Patterns
The most common failure mode in platform engineering is treating the platform team as a traditional infrastructure team: reactive, ticket-driven, measured by uptime alone. A platform that product teams are forced to use — but do not want to use — is a tax, not an accelerator.
Effective platform teams operate with a product mindset:
- Developer NPS. Run quarterly surveys asking product engineers to rate the platform on ease of use, reliability, and documentation. Target NPS above 30. Treat scores below 20 as incidents.
- Adoption metrics. Track the percentage of services using the golden path template, the standard Terraform module, and the Backstage catalog. Low adoption is a signal, not a discipline problem — it means the platform is not solving the right problems.
- Internal SLAs. Publish SLAs for platform capabilities: "New service scaffold completes in under 5 minutes," "Terraform plan runs complete in under 3 minutes," "Backstage catalog search returns in under 500ms." Track and report these like any production SLO.
- Office hours and embedded engineers. Two hours of platform office hours per week closes the feedback loop faster than any ticket queue. Rotating a platform engineer into a product team for one sprint per quarter generates the most actionable product insights.
- Public roadmap. A visible, prioritised roadmap (even a simple Backstage catalog page listing planned features) builds trust and reduces shadow-IT — product teams stop building their own tooling when they can see that the platform will solve their problem in Q2.
Build vs Buy vs Configure — IDP Technology Decisions
Every platform team faces the same question: for each capability, should we build it in-house, buy a SaaS solution, or configure an open-source tool? The right answer depends on team size, cloud maturity, and the strategic importance of the capability.
- Developer portal (Backstage vs Port vs Cortex). Backstage is the industry default for organisations with strong TypeScript capability and complex integration requirements. Port and Cortex are SaaS alternatives that reduce operational overhead at the cost of flexibility. For teams under 50 engineers, a SaaS portal is almost always the right choice.
- Infrastructure provisioning (Crossplane vs Terraform Cloud vs Pulumi). Crossplane shines when teams are Kubernetes-native and want a single control plane for workloads and infrastructure. Terraform Cloud (or Scalr, Env0) works better for teams with existing Terraform investment and multi-cloud requirements. Use Crossplane only when you have the Kubernetes operational maturity to run it safely.
- GitOps (ArgoCD vs Flux). Both are CNCF-graduated and production-proven. ArgoCD has a richer UI and ApplicationSets for multi-tenancy; Flux is more lightweight and Kubernetes-native in its design. Choose ArgoCD if UI-driven visibility is important to your platform; choose Flux if you prefer pure GitOps with minimal cluster footprint.
- Secrets management (Vault vs External Secrets Operator). HashiCorp Vault is the enterprise default for centralised secrets with fine-grained policies. The External Secrets Operator bridges Kubernetes with AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault — a simpler operational model for teams already committed to a single cloud provider.
Note
Work with us
Building an internal developer platform and struggling with adoption, golden path design, or self-service infrastructure?
We design and implement internal developer platforms — from Backstage service catalog setup and Software Template libraries to Terraform module registries, ArgoCD ApplicationSet-based GitOps self-service, Crossplane Composite Resources for cloud provisioning, DORA metrics dashboards, and platform team operating models with internal SLAs and developer NPS programs. Let’s talk.
Get in touch