How We Built the DataStax Astra DB Control Plane

Our goal was to provide a layer on top of tools like Kubernetes and Helm that developers could be successful with.

Feb 17th, 2022 7:06am by Jim Dickinson

Featued image for: How We Built the DataStax Astra DB Control Plane

Featured image provided by DataStax.

Jim is a senior software engineer at DataStax, focusing on Kubernetes and database orchestration.

After years of iterations and improvements, deploying a DataStax Enterprise (DSE) cluster is a pretty painless process. Download the tarball, run the install script and you’re done.

Oh wait — first you should provision a few servers for it to run on. Then it would be good to make sure the servers can communicate with each other on the right ports. Of course, there’s also ensuring your external clients can access certain ports while not being able to access others. And monitoring, that’s important too. Since we’re collecting metrics for monitoring, it would also be good to have a way to securely view them. Maybe deploying a single DSE cluster is a little involved.

Now imagine doing that thousands of times, in a reliable manner, and at the push of a button. Since we were building DataStax Astra DB on top of DSE, this is what we needed to accomplish. In order to facilitate this and other functionality around Astra, we built the Astra Control Plane.

Enabling Developers

Deploying the database may be one of the most important functions of the Control Plane, even though it’s only a small part of what it takes to run Astra DB. There are numerous other supporting services and jobs required to make it a success, including our web UI, billing process and user management. Each of these areas can be further broken down into other smaller services, all of which are developed and maintained by multiple teams. The one commonality among them, though, is the Control Plane.

When we built it, one of our primary goals was self-service. We wanted to provide the different teams supporting Astra with the ability to get their code from feature branch to production in as automated a manner as possible. This would also serve our own needs in supporting the multiple microservices we would need for deploying databases. Our goal was to provide a layer on top of tools like Kubernetes and Helm that developers could be successful with despite uneven familiarity with these tools and operations in general. Our developers’ primary concern was to write code and deliver features to support our users; they didn’t care how it was released, nor should they.

When we looked for something that would fit our needs (in early 2018) we found that most tools were either too expensive or required too much operations experience and were thus too complex. Therefore, we built a system known as DSCloud in order to abstract away the intricacies of continuous integration and continuous delivery (CI/CD).

At the heart of our build system is a file called dscloud.yaml. This file lives at the root of every git repository and defines how the repository should be deployed. Because each team is responsible for their own monorepo consisting of independent Dockerfiles within each subdirectory, DSCloud uses the organization structure of App, Microservice, Jobs and Cronjobs.

App: Maps one-to-one with a git repository and is deployed to Kubernetes as a namespace containing zero or more deployments, jobs and cronjobs.
Microservice: A subdirectory within the monorepo that should be a Kubernetes deployment (or StatefulSet) with associated ingress via either a private or public service.
Job: A subdirectory within the monorepo that should be a Kubernetes job. This job is guaranteed to run exactly once on each code push.
Cronjob: A subdirectory within the monorepo that should be deployed as a Kubernetes cronjob.

Each of the components that make up App, Microservice, Jobs and Cronjobs have additional configuration options that can be used to define how they should be deployed. Common to all three are environmentConfiguration and secretValues. An environmentConfiguration contains a list of environments (i.e., dev, test, prod) and for each the environment variables with values that should be set on the pods. The secretValues are a list of keys that are used to bind Kubernetes secrets at deployment time.

microservices:

foo-api:

description: "A backend service responsible for accessing the Foo database"

environmentConfiguration:

dev:

DATABASE_HOST: example-dev.com

TRACING: false

test:

DATABASE_HOST: example-test.com

TRACING: false

prod:

DATABASE_HOST: example.com

TRACING: true

secretValues:

- DATABASE_USERNAME

- DATABASE_PASSWORD

...

Where these three components differ is that a Microservice also has deployment, ingress and integrationTests as additional options. The deployment specifies configuration necessary for a Kubernetes deployment like memory and CPU limits, number of instances, and probes for readiness and liveness.

...

deployment:

memoryMB: 1526

cpuPct: 200

instances: 3

probes:

readiness:

protocol: HTTP

port: 8080

path: /

timeoutSeconds: 10

liveness:

protocol: HTTP

port: 8080

path: /health

timeoutSeconds: 10

...

Next, the ingress defines how traffic should be routed to the pod, such as what port should be used, the protocol, and path.

...

ingress:

- port: 80

containerPort: 8080

protocol: HTTP

pathPrefix: /foo-api/

pathPrefixRewrite: /

...

Finally, the integrationTests component is used to define which environments integration tests should run in and what environment variables they need.

...

integrationTests:

test:

env:

TEST_TYPE: full

secrets:

TEST_DB_USERNAME: foo-api

TEST_DB_PASSWORD: foo-api

prod:

env:

TEST_TYPE: happy_paths

secrets:

TEST_DB_USERNAME: foo-api

TEST_DB_PASSWORD: foo-api

...

These tests are convention driven based on subdirectory naming. The Dockerfile within that directory will be used to run a job containing whatever tests you have defined. The deployment tool will rollback or proceed depending on test exit code.

Moving on to the Cronjob, it expands on the base configuration options by adding cron-specific options as shown below.

cronjobs:

foo-updater:

schedule: "*/10 * * * *"

failedJobsHistoryLimit: 1

successfulJobsHistoryLimit: 1

memoryMB: 64

cpuPct: 10

environmentConfiguration:

dev:

DATABASE_HOST: example-dev.com

test:

DATABASE_HOST: example-test.com

prod:

DATABASE_HOST: example.com

secretValues:

- DATABASE_USERNAME

- DATABASE_PASSWORD

For anyone with Kubernetes experience, much of the dscloud.yaml should sound familiar; this is by design. In creating this build system, we tried to walk the line of exposing enough options to be useful to our power users while still abstracting away complexities for those without as much experience.

Now, defining a deployment manifest is all well and good, but at some point we need to actually get the code from git and into production, and this is where our Processor service comes into play. The Processor service serves as an orchestrator to our CI/CD process by:

Constantly polling git looking for changes in any of our repositories.
Triggering a build tool when it detects a new commit, which will build and publish a Docker image to our repository on success.
Creating a new Helm chart based on the dscloud.yaml file.
Kicking off the deployment tool to deploy the App to each environment.

Once we were able to build, test and deploy our code, it was finally time to start building Astra.

First Steps: Cloud-ish

In 2018, when we started on the journey of push-button deployments of DSE, Kubernetes operators were just starting to gain popularity and Apache Cassandra®-specific operators were pre-alpha. Thus, we decided to stick to what we knew, which was VMs in the cloud.

For the base case, we wanted a three-node DSE cluster plus a small Kubernetes cluster to run the ancillary services for the database. Due to familiarity with the platform, our first foray into everything was on AWS. We got to work creating Terraform scripts for provisioning the EC2 instances, EBS volumes and relevant networking. This needed to be an automated, push-button system, so we developed a Golang service to invoke the Terraform from a REST request which also did a fair amount of templating to support the different database sizes and regions.

Because infrastructure provisioning takes some time, we obviously couldn’t make our clients wait until we were done, so this had to be an asynchronous process. To accomplish this, we adopted Argo as our workflow engine, enabling us to break the infrastructure provisioning into multiple steps that, if implemented correctly, could be easily retried on failure. At this point, we had automated the infrastructure provisioning, although it wasn’t very exciting since nothing was deployed yet.

Once the Terraform completed, we were able to kick off another workflow for getting the database into a usable state. To help speed things up, we used custom machine images for building our EC2 instances that contained the DSE tarball and necessary startup scripts.

Order matters when starting up the DSE nodes: if you were doing this manually, you would start the first one, wait until it’s up, set the seed on the second to the IP address of the first, start the second node and so on. Because our nodes were starting in a random order, we couldn’t guarantee that behavior. This meant that we had to create our own locking and coordination mechanism for the nodes as they started up. Since initial DSE startup was handled by the machine image, our workflow could then come in to handle final configuration and setup along with deploying the services to the small Kubernetes cluster.

This process worked well for us for a time, but we soon started to run into issues. From the very beginning, we noticed that creating a new database took up to an hour, which makes sense considering we’re provisioning six EC2 instances (about 45 minutes) along with the other infrastructure needed. Another issue we ran into was repeatability. This method of creating databases did not prove to be very reliable since we were using so many pieces of infrastructure, and our scripts had to handle everything. For example, if an EBS volume didn’t attach correctly, our scripts had to detect that and then resolve the issue on their own. This got even trickier when we introduced new workflows to add more nodes to a database.

After the beta, we found that we were spending too much time on manual fixes, and finally we realized that the cost of this architecture was not sustainable. Fortunately, by this point our own Kubernetes operator, cass-operator, had made enough progress to be deployed for production usage.

Kubernetes All the Things!

With the introduction of the cass-operator, we could rethink our deployment model and move to a Kubernetes-centric architecture.

architecture chart

This meant that instead of spinning up individual compute instances for each cloud provider (i.e., AWS or GCP), we could start provisioning a single managed Kubernetes cluster like EKS or GKE for each new database. The benefits of this were three-fold:

Simplification: It greatly simplified our infrastructure provisioning process, as we no longer had to provision individual infrastructure components.
Reduced maintenance: We were able to reduce our maintenance burden since we could count on the operator to handle installation and validation of the new DSE cluster along with all the supporting services.
Fewer costs: Our cloud provider costs were reduced because we no longer needed to stand up dedicated hardware for DSE and then more for the Kubernetes cluster. Fewer compute instances mean lower costs.

Although this first pass was a vast improvement on our previous architecture, there was still room for improvement. As much as using Kubernetes simplified our deployment process and reduced our maintenance burden, creating new databases was still a painfully slow process and contingent on the whims of the cloud providers. Anyone who has spent time managing cloud infrastructure has experienced how often calls can fail and provisioning doesn’t succeed, so the fewer requests we can make, the better. What we wanted was to remove the need for infrastructure management from the many existing responsibilities of our Control Plane.

We accomplished this by moving to a namespace-per-database model. With this new architecture, infrastructure could be provisioned beforehand because we no longer needed to create a new Kubernetes cluster for each database, which reduced the create database process to a single workflow. Now, whenever we wanted to create a new database, all we had to do was kick off the one workflow that ran Helm, wait for the operator and then do some final configuration. Plus, with this new model, we were able to take advantage of bin packing, the algorithm by which Kubernetes maximizes the resource usage of a compute instance by scheduling the optimal number of pods with varying sizes, thus allowing us to more efficiently utilize our compute resources.

Conclusion

In order to support our goal of creating a database-as-a-service on top of DataStax Enterprise, we built the Astra Control Plane. To facilitate this and support the many different teams contributing to this project required the creation of a custom CI/CD tool that we call DSCloud.

With DSCloud, our development teams are able to quickly and efficiently ship code in support of our users without getting bogged down in the intricacies of cloud deployments. We then used this same tool to bootstrap our own efforts in creating push-button deployments of DSE clusters to the cloud.

Along the way, we didn’t always get things right the first time, but we continued to iterate and improve as we still do today. Furthermore, our Control Plane has proved to be flexible as Astra has grown and changed. It can handle an increasing amount of traffic in addition to accommodating changes to the data plane along the way, including the major re-architecture required by Astra DB Serverless. Thanks to the exceptional work of the Kubernetes community, we were able to much better scale and simplify our processes, leaving us with much more bandwidth to build out features for our users.

Jim is a senior software engineer at DataStax, focusing on Kubernetes and database orchestration.