Back to Blog
Platform EngineeringIDPKubernetesBackstageDevOpsGitOpsDeveloper ExperienceInfrastructure

Platform Engineering — Building Internal Developer Platforms That Teams Actually Use

A practical guide to platform engineering in production: the developer tax problem and golden path philosophy, IDP maturity model (wiki through product-grade), Backstage service catalog with catalog-info.yaml component descriptors and Software Templates for self-service scaffolding, self-service Terraform module registry with opinionated modules encoding security and compliance defaults, ArgoCD ApplicationSets with the git generator for pull-request-based deployment self-service, Crossplane Composite Resource Definitions (XRDs) for Kubernetes-native cloud provisioning (RDS, S3, Redis) without exposing raw cloud APIs, DORA metrics instrumentation with GitHub Actions and PagerDuty data for deployment frequency and lead time, and building platform teams as product teams with developer NPS, adoption metrics, internal SLAs, and public roadmaps.

2026-05-30

The Developer Tax Problem

In engineering organisations that have grown beyond a handful of teams, a hidden cost accumulates: the developer tax. Before writing a single line of product code, an engineer joining a new initiative must set up CI/CD pipelines, request cloud credentials, configure logging agents, wire up alerting, provision a database, and understand six different Slack channels to find the right person to ask for access to a staging environment. This friction — paid repeatedly, by every team, for every new service — is the developer tax.

The 2023 DORA State of DevOps Report identified platform engineering as the highest-leverage investment for improving software delivery performance. Organisations with mature internal developer platforms (IDPs) deploy 2.5× more frequently and have 2× lower change failure rates than peers without one. The business case is straightforward: reduce the developer tax, and you reclaim engineer-hours for product work.

  • Cognitive overhead. Each team independently learns Terraform, Helm, ArgoCD, and the org's cloud policies. Identical knowledge is acquired by dozens of squads instead of being centralised and reused.
  • Inconsistent standards. Without a paved road, every team designs their own approach to secrets management, log formats, and on-call wiring — creating a long tail of non-standard systems that are painful to observe and operate.
  • Hidden SRE bottlenecks. SRE or infra teams become a manual approval gate for every new environment, database, or Kubernetes namespace — blocking product velocity while creating a constant toil backlog.

Platform engineering solves this by treating the internal developer platform as a product: opinionated, self-service, built for the developer experience, and measured by adoption rather than just uptime.

Note

Platform engineering is neither DevOps nor SRE — it is a discipline that builds on both. Team Topologies formalises the platform team as an enabling team whose mission is to reduce cognitive load on stream-aligned (product) teams. The platform team provides capabilities as a service; product teams consume them without needing to understand the implementation.

IDP Maturity Model — Four Stages

Internal developer platforms evolve through recognisable stages. Knowing your current stage helps you prioritise investment and set realistic expectations for what an IDP can deliver.

# IDP Maturity Model

Stage 1 — WIKI & RUNBOOKS
  Capability: Documentation-only. Teams follow written guides to provision
              infrastructure manually or request it via tickets.
  Bottleneck: Human gates. Every new environment requires a ticket reviewed
              by a central team. Lead time measured in days.
  Indicator:  "Send a request to the infra Slack channel and someone will help."

Stage 2 — SCRIPTS & AUTOMATION
  Capability: Shell scripts, Makefiles, or one-off Terraform configurations
              checked into individual repos. Automation exists but is not
              standardised or shared.
  Bottleneck: Copy-paste culture. Each team maintains their own variant of
              the same script. Inconsistencies accumulate silently.
  Indicator:  "I'll share our pipeline YAML — just adapt it for your repo."

Stage 3 — GOLDEN PATHS & SELF-SERVICE
  Capability: Curated Terraform modules, Helm charts, CI/CD templates, and
              service catalog scaffolding. Teams consume standard building
              blocks through self-service interfaces. No ticket required
              for standard workloads.
  Bottleneck: Adoption. Teams with legacy setups resist migration. The golden
              path must be more attractive than the alternative, not
              just mandated.
  Indicator:  "Run this CLI command and your service will be deployed
              in < 10 minutes."

Stage 4 — PRODUCT-GRADE IDP
  Capability: Full developer portal (Backstage), programmatic APIs for
              infrastructure provisioning (Crossplane, Terraform Cloud API),
              DORA metrics dashboards, internal SLAs, and a product roadmap
              driven by developer NPS surveys.
  Bottleneck: Organisational maturity. Platform team must act as a product
              team — prioritising features, managing backlog, measuring
              adoption — not as an infrastructure team that reacts to
              requests.
  Indicator:  "Our DORA deploy frequency is tracked in the portal and
              our platform NPS is 42."

Golden Paths — Not Golden Cages

Spotify popularised the term "golden path" — a well-maintained, opinionated way to build and deploy a service that is faster and safer than building from scratch. The crucial distinction between a golden path and a golden cage is optionality: teams should be able to diverge from the golden path when they have a legitimate reason, without filing a support ticket.

A practical golden path for a microservice typically covers:

  • Repository scaffold. A Backstage Software Template that generates a repo with CI/CD, Dockerfile, Helm chart, Prometheus metrics endpoint, and structured logging pre-wired.
  • Infrastructure module. A curated Terraform module that provisions a namespace, ServiceAccount, NetworkPolicy, HorizontalPodAutoscaler, and PodDisruptionBudget with sensible defaults.
  • Deployment pipeline. A reusable GitHub Actions workflow or ArgoCD ApplicationSet that handles image build, image signing, vulnerability scanning, and promotion across staging and production environments.
  • Observability baseline. Grafana dashboard, alerting rules, and log parsing config shipped automatically with every new service — teams get SLO dashboards on day one.

Backstage — Developer Portal and Service Catalog

Backstage is Spotify's open-source developer portal, donated to the CNCF in 2020. It provides three core capabilities that form the foundation of a production IDP: a software catalog, TechDocs for documentation, and Software Templates for scaffolding new services.

Service Catalog — catalog-info.yaml

Every service, library, API, and team is described in a catalog-info.yaml file checked into the repository root. Backstage discovers and indexes these descriptors, creating a searchable catalog of the entire software estate.

# catalog-info.yaml — checked into every service repo root
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  description: Handles card payment processing and refund workflows
  annotations:
    # Link to GitHub Actions CI
    github.com/project-slug: acme/payment-service
    # Link to Grafana dashboard
    grafana/dashboard-selector: "service=payment-service"
    # Link to PagerDuty escalation policy
    pagerduty.com/service-id: P1234AB
    # Link to ArgoCD application
    argocd/app-name: payment-service-production
    # OpenAPI spec location
    backstage.io/techdocs-ref: dir:.
  tags:
    - python
    - payments
    - kafka
    - pci-dss
  links:
    - url: https://grafana.internal/d/payment-slo
      title: SLO Dashboard
      icon: dashboard
    - url: https://runbooks.internal/payment-service
      title: Runbook
      icon: help
spec:
  type: service
  lifecycle: production
  owner: group:payments-team
  system: checkout-platform
  dependsOn:
    - component:postgres-payments
    - component:kafka-cluster
    - resource:stripe-api
  providesApis:
    - payment-service-v2-api

Software Templates — Self-Service Scaffolding

Backstage Software Templates automate the creation of new services from a form-based UI. A developer fills in the service name, team owner, and database requirements; the template generates a repository with CI/CD, Helm chart, Dockerfile, and catalog-info.yaml pre-populated.

# template.yaml — Backstage Software Template for a Python microservice
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: python-microservice
  title: Python Microservice
  description: Scaffolds a production-ready Python service with Helm, CI/CD, and observability
  tags:
    - python
    - recommended
spec:
  owner: group:platform-team
  type: service

  parameters:
    - title: Service Details
      required: [name, owner, description]
      properties:
        name:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]{2,49}$'
          description: Lowercase, hyphens only (e.g. payment-service)
        owner:
          title: Owning Team
          type: string
          ui:field: OwnerPicker
          ui:options:
            catalogFilter:
              kind: Group
        description:
          title: Description
          type: string
          maxLength: 200
        enableDatabase:
          title: Provision PostgreSQL database?
          type: boolean
          default: false
        kafkaTopics:
          title: Kafka topics to create (comma-separated)
          type: string
          default: ""

  steps:
    - id: fetch-template
      name: Fetch base template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          owner: ${{ parameters.owner }}
          description: ${{ parameters.description }}
          enableDatabase: ${{ parameters.enableDatabase }}

    - id: publish
      name: Publish to GitHub
      action: publish:github
      input:
        allowedHosts: ['github.com']
        repoUrl: github.com?repo=${{ parameters.name }}&owner=acme-corp
        defaultBranch: main
        description: ${{ parameters.description }}
        repoVisibility: private
        topics: [${{ parameters.owner }}, python, microservice]

    - id: register
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml

  output:
    links:
      - title: Repository
        url: ${{ steps['publish'].output.remoteUrl }}
      - title: Open in Catalog
        url: ${{ steps['register'].output.catalogInfoUrl }}

Note

The Backstage scaffolder runs as a backend service inside your cluster. Each action: step is a plugin — the built-in action library covers GitHub, GitLab, Bitbucket, and generic HTTP. Custom actions (e.g., creating a PagerDuty service, provisioning a Vault namespace) are TypeScript plugins registered at startup. All actions run with service account credentials stored in Kubernetes Secrets — never in the template YAML.

Self-Service Infrastructure — Terraform Module Registry

The foundation of self-service infrastructure is a curated module registry. Rather than letting every team write ad-hoc Terraform, the platform team publishes opinionated modules that encode security, cost, and compliance defaults. Teams consume the module with a handful of input variables — they never touch the underlying cloud resources directly.

# modules/microservice-infra/main.tf
# Platform team module — teams consume this, not raw AWS resources

variable "service_name" {
  type        = string
  description = "Slug used to name all resources (e.g. payment-service)"
  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{2,49}$", var.service_name))
    error_message = "service_name must be lowercase alphanumeric with hyphens."
  }
}
variable "team_name"       { type = string }
variable "environment"     { type = string }  # staging | production
variable "replica_count"   { type = number; default = 2 }
variable "cpu_request"     { type = string; default = "250m" }
variable "memory_request"  { type = string; default = "256Mi" }
variable "enable_hpa"      { type = bool;   default = true }
variable "enable_pdb"      { type = bool;   default = true }

locals {
  labels = {
    "app.kubernetes.io/name"       = var.service_name
    "app.kubernetes.io/managed-by" = "terraform"
    "platform.acme.io/team"        = var.team_name
    "platform.acme.io/env"         = var.environment
  }
}

resource "kubernetes_namespace" "this" {
  metadata {
    name   = "${var.service_name}-${var.environment}"
    labels = local.labels
    annotations = {
      "platform.acme.io/owner"       = var.team_name
      "platform.acme.io/cost-center" = var.team_name
    }
  }
}

resource "kubernetes_service_account" "this" {
  metadata {
    name      = var.service_name
    namespace = kubernetes_namespace.this.metadata[0].name
    labels    = local.labels
    annotations = {
      # Enables IRSA — workload identity without static AWS credentials
      "eks.amazonaws.com/role-arn" = aws_iam_role.service.arn
    }
  }
}

resource "kubernetes_network_policy" "default_deny" {
  metadata {
    name      = "default-deny-ingress"
    namespace = kubernetes_namespace.this.metadata[0].name
  }
  spec {
    pod_selector {}
    policy_types = ["Ingress"]
  }
}

resource "kubernetes_horizontal_pod_autoscaler_v2" "this" {
  count = var.enable_hpa ? 1 : 0
  metadata {
    name      = var.service_name
    namespace = kubernetes_namespace.this.metadata[0].name
  }
  spec {
    scale_target_ref {
      api_version = "apps/v1"
      kind        = "Deployment"
      name        = var.service_name
    }
    min_replicas = var.replica_count
    max_replicas = var.replica_count * 5
    metric {
      type = "Resource"
      resource {
        name = "cpu"
        target {
          type                = "Utilization"
          average_utilization = 70
        }
      }
    }
  }
}

resource "kubernetes_pod_disruption_budget_v1" "this" {
  count = var.enable_pdb ? 1 : 0
  metadata {
    name      = var.service_name
    namespace = kubernetes_namespace.this.metadata[0].name
  }
  spec {
    min_available  = "50%"
    selector {
      match_labels = { "app.kubernetes.io/name" = var.service_name }
    }
  }
}

Teams consume the module with a minimal configuration file — they specify what they need, not how to build it:

# services/payment-service/infra/main.tf — owned by the product team

module "infra" {
  source  = "registry.terraform.acme.io/platform/microservice-infra/kubernetes"
  version = "~> 3.0"   # platform team manages breaking changes

  service_name   = "payment-service"
  team_name      = "payments"
  environment    = var.environment
  replica_count  = var.environment == "production" ? 3 : 1
  enable_hpa     = true
  enable_pdb     = true
}

# Optional: extend with service-specific IAM permissions
resource "aws_iam_role_policy_attachment" "stripe_secrets" {
  role       = module.infra.service_iam_role_name
  policy_arn = data.aws_iam_policy.stripe_secrets_read.arn
}

GitOps Self-Service with ArgoCD ApplicationSets

ArgoCD ApplicationSets extend ArgoCD with generators that dynamically create Application resources from templates. The Git generator scans a repository directory for configuration files and creates one ArgoCD Application per file — giving teams a pull-request-based, self-service deployment mechanism without direct ArgoCD API access.

# argocd/applicationset-services.yaml
# Scans apps/ directory — each subdirectory with a config.yaml becomes an ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: platform-services
  namespace: argocd
spec:
  goTemplate: true
  generators:
    - git:
        repoURL: https://github.com/acme/platform-apps.git
        revision: HEAD
        directories:
          - path: apps/*
  template:
    metadata:
      name: '{{ .path.basename }}'
      annotations:
        notifications.argoproj.io/subscribe.on-sync-failed.slack: deployments-alerts
    spec:
      project: platform-services
      source:
        repoURL: https://github.com/acme/platform-apps.git
        targetRevision: HEAD
        path: '{{ .path.path }}'
        helm:
          valueFiles:
            - values.yaml
            - 'values-{{ index .path.segments 1 }}.yaml'  # env-specific overrides
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{ .path.basename }}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - ServerSideApply=true
        retry:
          limit: 3
          backoff:
            duration: 30s
            factor: 2

Teams onboard a new service by adding a directory to the apps/ directory via pull request. No ArgoCD admin access required, no ticket to the platform team. The ApplicationSet controller picks up the new directory and creates the Application automatically after the PR merges.

# apps/payment-service/Chart.yaml — team-owned Helm chart
apiVersion: v2
name: payment-service
version: 1.0.0
appVersion: "2.14.0"
dependencies:
  - name: common
    version: "~> 1.5"
    repository: "https://charts.platform.acme.io"  # platform-provided base chart

---
# apps/payment-service/values.yaml
replicaCount: 2
image:
  repository: ghcr.io/acme/payment-service
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    memory: 512Mi

service:
  port: 8080
  metricsPort: 9090

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

ingress:
  enabled: true
  className: nginx
  host: payments.internal.acme.io

env:
  LOG_LEVEL: info
  OTEL_SERVICE_NAME: payment-service

Crossplane — Kubernetes-Native Cloud Provisioning

Crossplane extends Kubernetes with custom resource definitions (CRDs) that represent cloud infrastructure: RDS instances, S3 buckets, Redis clusters, and more. The platform team defines Composite Resources (XRDs) — opinionated, org-specific abstractions over raw cloud resources. Product teams create instances of these composites in YAML, the same way they create Deployments.

# platform/xrd-postgres.yaml — Composite Resource Definition (XRD)
# Defines the abstract "PostgreSQL" resource that product teams request
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xpostgresqlinstances.platform.acme.io
spec:
  group: platform.acme.io
  names:
    kind: XPostgreSQLInstance
    plural: xpostgresqlinstances
  claimNames:           # teams create PostgreSQLInstance (namespaced claim)
    kind: PostgreSQLInstance
    plural: postgresqlinstances
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: [parameters]
              properties:
                parameters:
                  type: object
                  required: [storageGB, tier]
                  properties:
                    storageGB:
                      type: integer
                      minimum: 20
                      maximum: 500
                    tier:
                      type: string
                      enum: [dev, standard, performance]
                    multiAZ:
                      type: boolean
                      default: false

---
# platform/composition-postgres.yaml — maps abstract spec to AWS RDS
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: xpostgresqlinstances.aws.platform.acme.io
spec:
  compositeTypeRef:
    apiVersion: platform.acme.io/v1alpha1
    kind: XPostgreSQLInstance
  resources:
    - name: rds-instance
      base:
        apiVersion: rds.aws.upbound.io/v1beta1
        kind: Instance
        spec:
          forProvider:
            region: eu-west-1
            engine: postgres
            engineVersion: "16"
            skipFinalSnapshot: false
            deletionProtection: true
            storageEncrypted: true
            copyTagsToSnapshot: true
            backupRetentionPeriod: 7
            performanceInsightsEnabled: true
          providerConfigRef:
            name: aws-production
      patches:
        - type: FromCompositeFieldPath
          fromFieldPath: spec.parameters.storageGB
          toFieldPath: spec.forProvider.allocatedStorage
        - type: FromCompositeFieldPath
          fromFieldPath: spec.parameters.multiAZ
          toFieldPath: spec.forProvider.multiAz
        - type: FromCompositeFieldPath
          fromFieldPath: spec.parameters.tier
          toFieldPath: spec.forProvider.instanceClass
          transforms:
            - type: map
              map:
                dev:         db.t3.micro
                standard:    db.t3.medium
                performance: db.r6g.large

Teams request a database the same way they would create a Kubernetes Deployment — via a namespaced Claim YAML in their own namespace:

# In the payment-service namespace — created by the product team
apiVersion: platform.acme.io/v1alpha1
kind: PostgreSQLInstance
metadata:
  name: payments-db
  namespace: payment-service-production
spec:
  parameters:
    storageGB: 100
    tier: standard
    multiAZ: true
  compositeDeletePolicy: Foreground
  writeConnectionSecretToRef:
    name: payments-db-connection  # Crossplane writes creds here automatically

Note

Crossplane's Composition layer is where the platform team enforces non-negotiables: encryption at rest, deletion protection, backup retention, and tag compliance. Product teams cannot disable these settings — they are baked into the Composition, not exposed as parameters. This is the security and compliance layer that makes self-service safe.

DORA Metrics — Measuring Platform Team Impact

The DORA four key metrics are the standard for measuring software delivery performance: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Failed Deployment Recovery Time. For a platform team, these metrics serve a dual purpose: measure the platform itself (how reliable is our golden path?) and measure the impact of the platform on product teams (did our self-service tooling increase their deploy frequency?).

# scripts/dora-report.py
# Generates DORA metrics from GitHub Actions and PagerDuty data

import requests
import statistics
from datetime import datetime, timedelta, UTC
from collections import defaultdict

GITHUB_TOKEN = "..."  # from env
PD_TOKEN = "..."      # from env
ORG = "acme-corp"
LOOKBACK_DAYS = 30
SINCE = (datetime.now(UTC) - timedelta(days=LOOKBACK_DAYS)).isoformat()

def get_production_deployments(repo: str) -> list[dict]:
    """Fetch successful production deploy workflow runs."""
    url = f"https://api.github.com/repos/{ORG}/{repo}/actions/runs"
    runs = []
    page = 1
    while True:
        resp = requests.get(url, params={
            "branch": "main",
            "status": "success",
            "event": "push",
            "created": f">={SINCE}",
            "per_page": 100,
            "page": page,
        }, headers={"Authorization": f"Bearer {GITHUB_TOKEN}"})
        data = resp.json()
        batch = [r for r in data.get("workflow_runs", [])
                 if r["name"] == "Deploy to Production"]
        runs.extend(batch)
        if len(data.get("workflow_runs", [])) < 100:
            break
        page += 1
    return runs


def deployment_frequency(deployments: list[dict]) -> float:
    """Deployments per day over the lookback window."""
    return len(deployments) / LOOKBACK_DAYS


def lead_time_hours(deployments: list[dict]) -> float:
    """Median hours from commit to production deployment."""
    lead_times = []
    for run in deployments:
        created = datetime.fromisoformat(run["created_at"].replace("Z", "+00:00"))
        # head_commit timestamp = when the triggering commit was authored
        committed = datetime.fromisoformat(
            run["head_commit"]["timestamp"].replace("Z", "+00:00")
        )
        lead_times.append((created - committed).total_seconds() / 3600)
    return statistics.median(lead_times) if lead_times else 0


def change_failure_rate(deployments: list[dict], incidents: list[dict]) -> float:
    """Percentage of deployments that caused a production incident."""
    if not deployments:
        return 0.0
    # Map incidents to deployments within 1-hour window
    failures = 0
    for incident in incidents:
        inc_time = datetime.fromisoformat(
            incident["created_at"].replace("Z", "+00:00")
        )
        for dep in deployments:
            dep_time = datetime.fromisoformat(dep["created_at"].replace("Z", "+00:00"))
            if timedelta(0) <= (inc_time - dep_time) <= timedelta(hours=1):
                failures += 1
                break
    return (failures / len(deployments)) * 100


def print_dora_report(repo: str) -> None:
    deps = get_production_deployments(repo)
    df = deployment_frequency(deps)
    lt = lead_time_hours(deps)
    print(f"\n=== DORA Report: {repo} (last {LOOKBACK_DAYS} days) ===")
    print(f"  Deployment Frequency : {df:.1f} deploys/day")
    print(f"  Median Lead Time     : {lt:.1f} hours")
    # Change Failure Rate requires PagerDuty data (omitted for brevity)
    # Classify against DORA elite thresholds
    freq_tier = "Elite" if df >= 1 else "High" if df >= 0.14 else "Medium" if df >= 0.03 else "Low"
    lt_tier   = "Elite" if lt < 1 else "High" if lt < 24 else "Medium" if lt < 168 else "Low"
    print(f"  Deploy Freq Tier     : {freq_tier}")
    print(f"  Lead Time Tier       : {lt_tier}")

Platform Team as Product Team — Organisational Patterns

The most common failure mode in platform engineering is treating the platform team as a traditional infrastructure team: reactive, ticket-driven, measured by uptime alone. A platform that product teams are forced to use — but do not want to use — is a tax, not an accelerator.

Effective platform teams operate with a product mindset:

  • Developer NPS. Run quarterly surveys asking product engineers to rate the platform on ease of use, reliability, and documentation. Target NPS above 30. Treat scores below 20 as incidents.
  • Adoption metrics. Track the percentage of services using the golden path template, the standard Terraform module, and the Backstage catalog. Low adoption is a signal, not a discipline problem — it means the platform is not solving the right problems.
  • Internal SLAs. Publish SLAs for platform capabilities: "New service scaffold completes in under 5 minutes," "Terraform plan runs complete in under 3 minutes," "Backstage catalog search returns in under 500ms." Track and report these like any production SLO.
  • Office hours and embedded engineers. Two hours of platform office hours per week closes the feedback loop faster than any ticket queue. Rotating a platform engineer into a product team for one sprint per quarter generates the most actionable product insights.
  • Public roadmap. A visible, prioritised roadmap (even a simple Backstage catalog page listing planned features) builds trust and reduces shadow-IT — product teams stop building their own tooling when they can see that the platform will solve their problem in Q2.

Build vs Buy vs Configure — IDP Technology Decisions

Every platform team faces the same question: for each capability, should we build it in-house, buy a SaaS solution, or configure an open-source tool? The right answer depends on team size, cloud maturity, and the strategic importance of the capability.

  • Developer portal (Backstage vs Port vs Cortex). Backstage is the industry default for organisations with strong TypeScript capability and complex integration requirements. Port and Cortex are SaaS alternatives that reduce operational overhead at the cost of flexibility. For teams under 50 engineers, a SaaS portal is almost always the right choice.
  • Infrastructure provisioning (Crossplane vs Terraform Cloud vs Pulumi). Crossplane shines when teams are Kubernetes-native and want a single control plane for workloads and infrastructure. Terraform Cloud (or Scalr, Env0) works better for teams with existing Terraform investment and multi-cloud requirements. Use Crossplane only when you have the Kubernetes operational maturity to run it safely.
  • GitOps (ArgoCD vs Flux). Both are CNCF-graduated and production-proven. ArgoCD has a richer UI and ApplicationSets for multi-tenancy; Flux is more lightweight and Kubernetes-native in its design. Choose ArgoCD if UI-driven visibility is important to your platform; choose Flux if you prefer pure GitOps with minimal cluster footprint.
  • Secrets management (Vault vs External Secrets Operator). HashiCorp Vault is the enterprise default for centralised secrets with fine-grained policies. The External Secrets Operator bridges Kubernetes with AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault — a simpler operational model for teams already committed to a single cloud provider.

Note

The most common mistake is building too much too soon. Start with the golden path that addresses your teams' biggest pain point — usually CI/CD templating or Kubernetes namespace provisioning — and measure adoption before investing in a full Backstage deployment. A markdown runbook that saves two hours per new service is more valuable than a half-finished developer portal that nobody uses.

Work with us

Building an internal developer platform and struggling with adoption, golden path design, or self-service infrastructure?

We design and implement internal developer platforms — from Backstage service catalog setup and Software Template libraries to Terraform module registries, ArgoCD ApplicationSet-based GitOps self-service, Crossplane Composite Resources for cloud provisioning, DORA metrics dashboards, and platform team operating models with internal SLAs and developer NPS programs. Let’s talk.

Get in touch

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.