Hi there! In this post I’ll describe the first step we took as a team when building out our scalable internal platform. Our platform is built for backend services and while backend applications generally need a compute platform to run on, they also require infrastructure resources (database instances, pub/sub topics, storage, cache, etc.) to consume. In the following I’ll discuss how we organized the latter.
For the compute platform we already had a Kubernetes cluster set up, and we had tooling around it to make it easy to use. But what about infrastructure resources? Kubernetes naturally supports the isolation of objects through namespaces, so we wanted to apply something similar to the rest of the infrastructure.
Our dream is to create a platform for teams that enables them to manage their own resources and helps scaling our infrastructure for the (not-so-far) future needs of Bitrise. For this, we need to create an environment for teams where they are free to allocate and set up resources in a way that best fit their needs, with minimal effort and configuration required, and without having to think about how to connect them to their services running on the compute platform.
In the first pilot version of our platform we only had two environments (projects) set up on GCP where all the infrastructure and the clusters themselves were hosted. While the simplicity of this arrangement was appealing, we quickly ran into issues, especially when trying to separate teams' resources.
We obviously didn’t want to allow teams to (accidentally) alter other teams' resources to limit the blast radius of a misconfiguration or a security incident. We have set up IAM rules using conditions, limiting access for users and service accounts to resources with a given prefix. Unfortunately, not all kinds of resources support this (e.g. pub/sub doesn’t), so our solution was incomplete at best.
Moreover, the complexity and maintenance overhead of managing and querying the status of these conditions proved to be infeasible even for a handful of teams. For example, to allow teams to manage the vast amount of infrastructure resources they own with their service account running their CI, we had to create extensive IAM rules with complex conditions, which made the system quite brittle and hard to change. Also, there wasn’t anything displayed on GCP console that signalled teams which resources they can access or how to create resources for their applications. Clearly, this setup was not fit for scaling. ⛔️
Projects and folders
For this reason, we have decided to create separate projects on GCP. We have defined groups of resources and services called systems, and regarded them as the atomic element in the project setup. Each system in our platform has its own associated projects on GCP for each execution environment (staging, production), and the resources are defined in these projects. This way, access rights need to be defined only for projects, making it much easier to manage them. Note: this is still about infra elements like databases, Kubernetes objects are still created in their own namespaces, in the same cluster.
As for IAM rules: GCP has the concept of folders, that allow setting IAM rules on a group of projects. This way, moving projects of all owned systems under a single folder allows granting rights to the owner team at a single central place. Teams can easily find their owned resources and it’s obvious what they are allowed to manage. This natural arrangement of resources proved to be much cleaner and this is actually what GCP officially recommends.
In this setup, we have also created a separate system for the cluster(s), that allowed the Internal Platform team to perform administrative actions without the risk of inadvertently messing with teams' own resources. Now that we have answered the question of where to put resources and how to access them, what’s left is to connect them to the consumer services. 🔌
When facing the problem of establishing network connection between different GCP networks (and projects), one would naturally think of VPC network peering. This is a straightforward method of linking two networks (defined in different projects) together to allow communication.
After doing a brief investigation, however, it was clear that this will not suit our use case either. The main reason for this is that this setup was not designed for the scale we wanted to operate in. VPC peering is not transitive, so we’d have to connect all projects with at least the cluster projects, but possibly even with each other (i.e. in a mesh topology). There’s a limit on how many networks can be peered, which poses a (quite small) upper limit on the number of projects we can support, which is obviously unacceptable.
Moreover, even if we didn’t have to face such limits, setting up individual IP ranges and firewall rules on both ends of each participant network in a way that avoids clashes would be an administrative nightmare. We are also using Cloud SQL for databases, which in itself uses VPC peering to connect to a network. This, and due to the lack of support for transitive peering, would make it impossible to reach databases from multiple projects. Clearly, there had to be a better way to set up networking. 🤔
There’s another networking pattern on GCP designed to connect multiple projects, and this is called shared VPC. In this setup, there’s a single network defined in a project (called host project), and other projects are connected to it (called service projects). In this star topology the connections are transitive and no additional configuration is required to connect projects to each other, which makes this configuration quite easy to set up and supports the level of scaling we’re aiming for. It became quickly apparent that this arrangement is exactly what we need. 🤩
Besides setting up the host and service projects there are some special IAM rules involved in this setup: for example, we had to grant Compute Network User rights to the generated (and a bit hard to find 👀) built-in kubernetes service account, and add Host Service Agent User to the container engine robot service account. And because GKE is a managed solution, it automatically creates some firewall rules and resources (e.g. when creating a LoadBalancer), so we also had to create a custom role and grant that to GKE’s service account. While this might sound peculiar, this is a single one-time complexity (and one that is properly documented and can be codified).
The benefits do not stop at the ease of setup and scaling, though. Separating the host and service projects also allows us to separate the administrative configuration associated with networking (setting up subnets, NAT, firewalls, managed services' peering) from the compute platform’s setup or the teams' own resources.
This separation is desirable not only from a security standpoint (only admins have access to host projects), but also decreases management complexity. In this setup, the clusters are set up as service projects just like every other resource, so extending our platform into a multi-region multi-cluster setup can also be easily done. But for now, we are just enjoying the tidiness of separating the cluster’s and the network’s administration.
There was actually one downside of building a Shared VPC solution. Google Kubernetes Engine unfortunately does not support changing a cluster’s network, so we had to build a new one using the new network.
In case of stateless services we could’ve gotten away with a zero-downtime migration using network switchover (e.g. using a Cloudflare proxy), but we also had to migrate storage objects (database instances), which was much easier to do under a planned maintenance. In case of a single service, this also could’ve been done in a zero-downtime fashion, but given the many services already running on the platform, we deemed the cost of a planned downtime less than orchestrating a complicated zero-downtime migration for everything.
Having the existing infrastructure codified in Terraform (and wrapped in Terragrunt) helped us in the migration effort tremendously. Creating the new cluster and resources to accept the migrated services and data was relatively easy, we just had to account for the different project IDs and of course create the modules for the shared VPC network setup. The same goes for teams' existing infrastructure and the CI pipelines managing them. Imagine the nightmare if we were to locate everything by hand and move them manually! 🚀
Now that resources are well-isolated and IAM rights are set up for teams, our next step was to enable the management of these resources in a self-served way. We’ll discuss how we did that in a future post. But I can tell you now that this shared VPC setup with the above project structure was really worth it — creating automations for such a cleanly separated architecture is an absolute delight.
In the meantime, if you’re interested in more technical details or perhaps if you'd like to work on similar problems, we’d be happy to chat with you, preferably during an interview 🙂 That is, we’re hiring!