Building Healthy Relationships

Tell me if this sounds familiar.

Your platform engineering team wants to offer some things in a self-service way to developers. You start with the Github repo so that you can get centralized visibility into the projects your teams are creating, to establish ownership of a project, and to enable approval requests through gitops workflows. You finish building your object that developers can consume and it looks like this:

apiVersion: platform.your-org.com/v1beta1
kind: GitRepository
metadata:
  name: my-repo
spec:
  access:
    - team: payments
      role: maintain
    - team: search
      role: write
  branchProtection:
    - branchPattern: 'main'
      rules:
        requireConversationResolution: true
        requiredLinearHistory: true
        requiredPullRequestReviews:
          - dismissStaleReviews: true
            requireCodeOwnerReviews: true
            requireLastPushApproval: true
            allowBypass:
              - my-github-app
        requireSignedCommits: false
        requireStatusChecksToPass:
          - security-scan
          - build
          - publish
        allowForcePushes: false
        allowDeletions: false

Everyone is using that composition and it works pretty well. The developers get an experience they are familiar with just packaged a little differently. Easy peasy. Adoption is high and complaints are low so you want to try this self service thing again. Next, you want to offer database management to developers. The resource you come up with looks like this:

apiVersion: platform.your-org.com/v1beta1
kind: Database
metadata:
  name: my-database
spec:
  availabilityZones:
    - us-east-1a
    - us-east-1b
    - us-east-1c
  vpc: production
  subnetGroupName: production-databases
  instances:
    replicas: 4
    class: "db.r6i.2xlarge"
  globalLoadBalancer: false
  backups:
    enabled: true
    retention: 4
    schedule: "0 4 * * SAT"
    destination: s3
    bucketName: my-database-backup
  tags:
    team: payments
    environment: production

Your developers start using this and it works reasonably well too. They still need to request access through a help desk ticket because you don’t want anyone approving their own PRs to grant excessive permissions to themselves though. Sometimes they also need help with things like specifying the correct availability zones and the correct VPC. There is a fair bit of copy and pasting this manifest around so these resources tend to cluster in common zones but overall everyone is still happy. It’s faster to open a PR and have DevOps approve than the ticket-ops approach you had came from anyway.

One day, your manager comes to you and says, “We are going to apply for $X security certification. We need to make sure our platform is compliant with the required specifications.” By this point, you have hundreds of repositories for all of your microservices and dozens of databases in addition to the other resources that you have decided to offer through self-service workflows. It’s going to be a big effort to figure out what settings are required to meet those compliance standards across all of your resources, find what projects are or aren’t already compliant, discuss with the teams what changes need to be made, implement those changes in every individual manifest, and find a way to prevent someone who is uninformed from reverting those changes. This security compliance is turning into a big project.

I’m using Crossplane resources as examples here but the concept would be exactly the same if you were using Terraform. This pattern is extremely common. I’ve built them myself. DevOps teams expose what is necessary to create those resources and hand them off to the dev teams. The thing is though, I think these patterns, while fine in the short term, are really awful over the long term and especially break down at scale.

Why?

There are two primary issues with the above approach:

First, these examples represent pseudo self-service resources. A developer could use these independently but they will inevitably be confused or misinformed about which options are correct for them to use. The options that are exposed are what ops thinks is important or necessary, not what developers care about. They may not know that db.r6i nodes are what is currently being used for your company’s compute savings plan. They select db.r7i and are suddenly spending a lot more for a service that doesn’t benefit from the faster machine. Or they may not be aware that there is cross-zonal billing considerations to account for when specifying multiple availability zones. It’s not their fault, it’s just not their responsibility to know these things. In those situations, they need guidance from a DevOps engineer.

These pseudo self-service resources are essentially a shared resource between dev and ops teams. The goal though is to have a fully self-service platform that developers can operate themselves. We don’t get there by exposing configuration options. We expose intent.

Second, there is no connection between the resources in our examples. A repository is a repository. A database is a database. If you need to implement minimum baselines for compliance that spans multiple resource types, you need to track down every active usage of every self-service resource across every app. It is very tedious, prone to error, and still requires the development teams understand what settings they shouldn’t change after the fact because it would break the compliance standard you are trying to implement.

Instead, all resources should automatically adjust to the intended state the development team wants their project to exist in. To achieve that we need to establish relationships between the object where those standards are defined and the resources being used by that project. Databases can’t simply be databases anymore. They need to be databases in use by a specific project.

Let’s build an example to see what that looks like.

The High Level

Imagine we have a top-level resource called a Project. When a dev team wants to start a new project, what is the first thing they need? A git repo. It’s not really possible to build something without any code so this is the logical “top” of that particular app’s platform. It might look something like this:

apiVersion: platform.your-org.com/v1beta1
kind: Project
metadata:
  name: really-cool-microservice
spec:
  owner: "[email protected]"
  additionalCollaborators:
    - "[email protected]"
  description: "A really cool feature for searching existing data"
  serviceClass: backend-api
  priorityClass: high
  deploymentLevel: production
  infrastructure:
    buildContainerImage: true
  contactDetails:
    jiraBoard: SEARCH
    slackChannel: "#engineering"

What this creates:

A Github repo based on .metadata.name
Grants team access to that repo. The owner gets Maintain. Any collaborator gets Write.
A standardized set of branch protection rules
A CI workflow and a container registry to build and publish a container image
A catalog-info.yaml file created in the repository so that the project is automatically ingested into Backstage.

This is our top-level resource. What comes under it? Everything that project needs to function.

The core concept is that this Project resource is the unit of ownership and everything it requires (databases, IAM roles, storage buckets, pub/sub topics, etc) are “attachments” to this top-level construct.

Let’s see what this looks like in practice.

Imagine we want to offer databases and storage buckets through a self service workflow. Those resources could look like this:

apiVersion: platform.your-org.com/v1beta1
kind: Database
metadata:
  name: really-cool-microservice
spec:
  internalProjectRef:
    name: really-cool-microservice
  engine: postgres
  version: 17
  size: large

apiVersion: platform.your-org.com/v1beta1
kind: Bucket
metadata:
  name: really-cool-microservice
spec:
  internalProjectRef:
    name: really-cool-microservice

In these examples, we define simplified objects and refer to the top-level resource using .spec.internalProjectRef instead of defining individual configuration details. These child resources read values about its parent and make resource rendering decisions based on the state it finds. How the infrastructure for these resources are ultimately constructed depends on the state of the top-level object.

What kind of rendering decisions can you make with this kind of structure? Here are some examples:

Automatically delegate access: The deploymentLevel is considered production based on the Project spec. The defined owner could get read-only access to both the bucket and database. Collaborators could get no access. If the deploymentLevel was still in dev, both owners and collaborators could have read-write access. Deploying to a production environment would depend on changing deploymentLevel to production. If there is a reorganization of teams in your company, changing the owner or collaborators automatically delegates access to those new teams to both the database and the bucket simultaneously. Expand this pattern to anything (SQS, Juipter notebooks, your secrets provider, etc) and you can see the bulk of your IAM management is entirely removed from your plate.
Establish standards: With size: large for the database, the node class would be defined and controlled by the platform team to ensure full coverage under the org’s compute savings plan. Developers don’t need to be concerned about choosing the wrong option. They have a big service, they get a big database and move on with their day.
Enhance reliability: With a priorityClass set to high, we could create cross-regional database replicas, backup schedules, retention policies, and global load balancers by default. The bucket could have (or not have) a LifecycleConfiguration that automatically shifts objects to Glacier storage based on that same field. If the priorityClass was low maybe the database only operates in a single availability zone to cut costs on cross-zonal network traffic.
Automatic alerting: Because we track .spec.contactDetails in the Project resource, we can automatically setup escalation policies for things like db incidents (cpu throttling, replication delays, long running queries, etc). This can be expanded to any new infrastructure resource attached to the project. Modifying these centrally defined contact points automatically modifies the alert policies for EVERY resource attached to that project.
Consistent tagging: The tagging approach is pre-defined by the platform team, follows a predictable pattern based on the options defined in the provisioned resources, and not exposed to the end user. This makes it easier to track billing and to give teams the insights to make their own modifications based on cost savings measures. It would no longer be the ops team’s responsibility to find cost savings across the entire platform. The financial data could be exposed to developers directly to act on however they best see fit. If the team decides that they could get away with a medium database instead of large, they can do it. No ops team involvement necessary.

I mentioned compliance standardization earlier. This is one of my favorite use cases for this pattern. For example, imagine adding this to your top-level resource:

apiVersion: platform.your-org.com/v1beta1
kind: Project
metadata:
  name: really-cool-microservice
spec:
  complianceStandards:
    - ISO
    - SOC2
  ...

The effects of setting this would trickle down to all managed resources. You would get minimum standards for branch protection rules, infrastructure access, modified data retention policies, etc. The development teams need to know literally nothing about the details of how this works. They say “I need to be SOC2 compliant” and 5 minutes later ALL resources related to that project are compliant. Everything. Teams can do it themselves, ops is never involved, and we can be more confident that those settings won’t be accidentally reverted because we expose desired intent instead of individual configuration options to end users.

If you’ve never been involved in compliance standardization before, let me break it down for you:

That’s huge. Like.. massively huge.

What’s the catch?

This approach pretty heavily abstracts the underlying resource configurations. It takes options out of developers hands and is, by nature, more restrictive for them. The trade off is that they gain speed through predictability and reduced cognitive load. No one asks for a db.r7i.2xlarge postgres db with cross-zonal routing. They just ask for a database and we give it to them. They don’t need to worry about those options so we remove that responsibility from their plates and they get back to the real work of implementing core business logic. If there are rough edges to a composed resource from their perspective, it’s easy to work around because it’s predictable.

To be successful with implementing these abstractions, you will likely need two things: a social contract and a lot of documentation.

The most important, I believe, is the social contract between the ops/platform team and the dev teams consuming these resources. That contract should roughly be established as:

The platform team promises to deliver the most widely scoped permissions to a resource that is possible. An access restriction will only exist if it is in service to a security measure, compliance requirement, or company policy in that order.

As platform teams, we are often disconnected from the day-to-day needs of the developers. We can’t assume what they should or shouldn’t have. By default, we should try to give them access to whatever they want as long as it isn’t causing an unnecessary burden on another team or violating a security/compliance guideline. The reasons for implementing a restriction in access need to be documented for transparency though. With these things in place, discussions about granting additional access or reasons for a restriction have a foundation in collaboration between dev and ops rather than a friction point of “platform won’t give me access to X.”

The second requirement, docs, is important more from a defensive position. We don’t want an uptick of inbound tickets to platform or SRE teams asking how to achieve something with their infrastructure. We need to describe the effect of modifying each field in our resources so that there are no surprises for the devs. If they need a backup policy implemented for a database, it may not be obvious that they need to set priorityClass to high in their Project manifest. This information needs to be easily discoverable and understandable.

Don’t forget AI

Everyone’s favorite topic..

When we have our top-level Project resource automatically creating a Backstage catalog-info.yaml file, it means that all apps are registered in the software catalog automatically. If, likewise, all of our custom resources are also following this pattern, they should all be included in that catalog as well. Who loves consistent structures, well documented resources, and a centralized location for discovery? Agents.

It would be trivially easy to build an agent who refers to the Backstage catalog to see if it’s possible to accomplish the task a user asks for. Imagine this exchange:

User

I need a postgres database.

Agent

Let me check to see if I can do that.

Fetch(https://backstage.your-org.com/)
  200 OK

Sure! I can create a postgres database for you. I just need a few questions:

What project will be using this database?
What version of postgres do you need?

Creating this resource is absurdly simple. Does the developer care about the deployment pipeline or provisioning system for this database? NO! Do they need to remember where the web form they need to fill out is? NO! Adhere to a PR template? NO! They shouldn’t have to either. None of that is their responsibility. If they need a database, they get a database and we get out of their way. No jira tickets for access requests. No slack messages for clarification on config options.

Offering these custom resources means that we have already established guardrails in our workflow too. It is entirely safe for an agent to be managing your infrastructure when these are the primitives it is working with.

Key Takeaways

Expose intent, not configuration options.
Build relationships between resources, not standalone objects.
Document:
- What resources are available for consumption.
- What happens when a specific config option is set.
- Why you restricted access to something.
Leave space for discussions with the dev teams to reevaluate the decisions you’ve made.