Platform engineering as a standards body
I believe there is a subtle but important misconception with the role platform engineering plays within an org from the perspective of both corporate leadership and from my engineering colleagues. The assumption is that a platform team’s prime directive is to build tooling which make other engineer’s lives easier or to productize the infrastructure layer of an organization.
This isn’t necessarily wrong. It just misses a much MUCH more important aspect of our field.
Instead, I argue that platform engineering would be better described as an internal standards body. Our prime directive is to solicit feedback about a pain point or some inconsistent aspect to an internal environment, establish a standard for best practices based on that feedback, and document our findings.
At this point, a platform team’s role could continue with the normal tooling we all expect (IDP, IaC, CI/CD, etc) but we could also, if appropriate, simply publish the standard along with an enforcement policy which puts that standard into effect. In this case, the work of conforming to the standard shifts away from platform engineering and onto other teams. The platform team didn’t really build anything in this situation but they are influencing the wider direction of the platform. Are they no less a platform team if this is the approach taken?
The nuts and bolts of actually building a platform with an IDP, IaC stack, and CI/CD pipelines is simply the final output of the efforts spent establishing standards. We use all of those things to implement the final product but the final product is much less important than the standards themselves.
Consider another example, your org wants to adopt an IDP like Backstage. The org determines that the role of Backstage should be to ultimately track details about things like a service’s source repository, artifacts in a container registry, the service’s Kubernetes manifests in a separate deployment repository, and health of the resources in a destination cluster.
Now imagine that there is no standardized naming convention across each of those different resource types. Maybe the source repo was created with an original name but the Kubernetes manifests follow a colloquial name. Maybe a suffix was added at some point to distinguish from something else that is no longer relevant. Orchestrating this would be possible but painful and slow. Velocity wouldn’t be achieved by adopting an IDP. It would be achieved by establishing a standard for resource creation, automated tracking of relationships between resources, and, within those automations, building a consistent naming pattern. With those standards established, we can rename resources to match the standard and safely onboard existing services to the platform.
Velocity, reliability, and security are the goals of platform engineering and they come from predictability. You don’t get predictability without established standards.
The tooling you build once you arrive at those standards must strive to be exceptional. If your platform isn’t flexible enough to fully support the end user you will see a proliferation of one thing:
Exceptions
Even if your platform is indeed exceptional, there will inevitably be a request to deviate from the standards you define. There are a lot of ways to handle these situations. It should be rare that “you must do it this way” is the answer. We need to ensure that our platforms are flexible enough to handle one-off approaches or legacy services.
In these situations there are a few different options you have available to you:
1. Be available for architectural discussions for new services
Sometimes development teams can achieve their own end goals by using an approach they didn’t consider before simply because it’s not their area of expertise.
For example, maybe your standards dictate that one microservice can only include one Kubernetes deployment. A dev team needs to run a secondary workload though. They may not have considered that the secondary workload could be run as a sidecar container instead of bending the standards to allow for multiple deployments or by creating an entirely new microservice.
2. Work with the teams to change the architecture of their app
Maybe your platform states that health checks must conform to a specific standard endpoint, authentication method, and test criteria. There are legacy apps that don’t follow those standards though.
If this is a platform requirement, maybe the platform team could develop it’s own library which implements those health checks. Developers wouldn’t need to worry about the specifics and platform would get both consistency and centralized management of those health checks.
3. Change the standard
Sometimes development teams have legitimate requests that do require modifying the existing platform.
Maybe the platform mandates that metrics-based horizontal pod autoscaling and automatic vertical pod autoscaling be in place for any workload running in production. That might be overkill for a small service that only ever sees 20 requests per day. Are we going to require that of development teams or do they have more important things to do than evaluate horizontal scaling thresholds in their metrics? The standard might be better if it describes when an HPA and VPA are required rather than that they are required.
4. The service is considered “unsupported”
This is mostly a worst case scenario situation. It should only really be considered a fallback if the other approaches don’t work. In an ideal situation, your platform would be flexible enough that this approach could only be applied to specific aspects of an application rather than the application as a whole. Regardless, it is something that should be in your toolbox when the appropriate situation presents itself.
This approach would place the responsibility of things like build automation, manifest authoring, and runtime management exclusively onto the team owning the project. It requires understanding and cooperation from a few different key departments in the org to be successful:
- Platform needs to hold themselves to this requirement so that they are able to protect their ability to serve the wider majority of users. If one unsupported service consumes an outsized amount of platform’s engineer bandwidth with support requests it negatively affects the experiences of other teams.
- SRE needs to be aware that a service is unsupported or not for much the same reason but also so as not to undermine the decisions of platform or to get into a situation where SRE becomes an unintentional support channel for services running outside the established standards.
- Security needs to be able to implement guardrails so that information isn’t being unintentionally exposed or mishandled.
- Management need to support the decision of the platform team to hold firm on standards conformance. If standards can be bent and undermined, the standards cease to have any true meaning.