Behind the Code Series: How We Migrated to Kubernetes

Coding

How Choosing Kubernetes Simplifies Code Deployment

In my previous R&D blog post I detailed Cradlepoint's adoption of a microservices architecture. The proliferation of microservices meant we had more applications that needed built, deployed, and monitored. This became an operational challenge at scale. 

This post is the first in a series that discusses how we solved this challenge by migrating our service deployments to Kubernetes.

Too Many Ways to Deploy Code

Cradlepoint uses Amazon Web Services (AWS) to host both NetCloud Manager (NCM) and NetCloud Perimeter (NCP). Each of these applications is composed of many microservices — each requiring scripts and tools to deploy to AWS.    

  • Since both applications were developed independently (NCP was derived from the Pertino codebase), each application used different deployment tools.
  • When a team created a new service, they deployed it with tools they were familiar with, or they started to use something new — either built in-house or open source.
  • We never migrated older services to use new deployment tools, because the new tools were only marginally better. 

Over time, we had too many ways to deploy our code. The table below illustrates the various ways we deployed a subset of our services.

The lack of standardization quickly grew into a problem. Teams had to understand and maintain multiple deployment scripts and tools. Creating a new microservice had the additional overhead of figuring out how to deploy it since the "correct" way wasn't obvious. We migrated to a microservice architecture to improve developer efficiency, so developers could spend more time building customer features, but this problem was slowing us down.

This problem was not unique to Cradlepoint. The web development community had invented several solutions.

Choosing Kubernetes

We needed to standardize.  Since we already used Docker for our Continuous Integration pipelines, it was logical to use Docker images as our primary deployment artifact. Docker images abstract the application language and its dependencies from our build and deployment pipeline: e.g., Java/Spring applications are built, stored, and accessed the same as Python/Django applications. Docker images are then run as containers on our production servers.

Just as an application needs an operating system to run it (to schedule, allocate resources, etc.), containers need something to run them and to allow them to discover and communicate with each other. Containers need an orchestration platform.

There were four leading contenders for container orchestration when we made our decision. We considered Docker SwarmMesosAmazon ECS, and Kubernetes.  We already had some experience with Kubernetes from the NCM's Remote Connect feature, and we were happy with it. After looking at the pros/cons of the other options, we decided to build our new pipeline using Kubernetes.

We bet on the right horse. Shortly after we made our decision, it became clear that Kubernetes had "won the orchestration wars." Mesosphere added support for Kubernetes in September 2017. Docker added support for Kubernetes in October 2017. Most importantly for Cradlepoint, AWS announced a managed Kubernetes service in November 2017. 

As a bonus, Kubernetes is more than just a container orchestration platform. It provides patterns and tools to improve service reliability, security, and scalability, and it is adding more capabilities with every release.

Building A New Deployment Pipeline

We created a Squad to build our new Kubernetes-based deployment pipeline. The Squad had engineers from the DevOps, Build, and select Service teams. We collected requirements from the lab, focusing on how we could improve developer, qualification, and deployment workflows. We were delighted to discover that moving to Kubernetes not only solved our "too many ways to deploy code" problem, but it also enabled us to address other challenges.

We made the following design decisions for the new pipeline:

  • We will gradually replace existing production services with Kubernetes (instead of a big-bang cutover). As we replace each service, we will use a canary-first configuration to run a small percentage of load against the new Kubernetes-based deployment until it is verified stable.
  • All services will be packaged and deployed using Helm, the Kubernetes package manager maintained by the Cloud Native Computing Foundation (CNCF)
  • The Helm chart, configuration, and corresponding Dockerfile will be in the same git repository as the service. A chart's secrets will also be stored in its git repo, encrypted using SOPS and AWS KMS. This has the inherent advantage of pinning source code, configuration, and packaging to the same immutable git version.
  • Our entire deployment pipeline will be implemented using "pipeline-as-code" with all source code in git. There will be no untracked changes to the pipeline.
  • The deployment pipeline will only support upgrading one service at a time. This is important. If only one service changes at a time, then it is much easier to detect and address breaking changes. Such constraint requires developers to consider deployment as they write their code. Change to APIs and rolling credentials will be a three-step process: The provider adds new API or credentials, clients are updated to use new API or credentials, then the provider removes deprecated API or credentials.
  • Observability is a first-class design objective. Dashboards and tools should make it trivial to see what is deployed; logging, monitoring, and alerting should be consistent and easy to for developers to use.

Initial Pipeline

Since we are gradually migrating our current (now called "legacy") production deployment to Kubernetes, our initial pipeline supports our hybrid Kubernetes/non-Kubernetes "Test" and "Production" stacks. Here is a brief overview of this pipeline:

(a) The developer commits the service code, Dockerfile, Helm charts, profile configuration, and secrets to the service's git repository.  

(b) The Continuous Integration (CI) system builds, tests, and validates the Docker image and Helm chart for the service. Since the Helm chart is a tar file, the build process also adds the "profiles" directory to the chart artifact. The Docker image and Helm chart are pushed to a registry that is accessible by downstream processes.

(c) We use the umbrella chart Helm pattern to build a "manifest" that contains all versions of all services that have passed a certain quality bar.  We call this quality bar "L3."  Once a Helm chart artifact from step (b) is built, the CI pipeline creates an L3-candidate umbrella chart to be tested in step (d).

(d) The L3-candidate is used to create a new Kubernetes-based ephemeral stack, basically using "helm install l3-candidate." System tests are executed against this newly created ephemeral stack. If all tests pass, then the L3-candidate umbrella chart is labeled as the new L3-manifest. If the L3 tests fail, then the L3-candidate is rejected and the L3-manifest remains unchanged.

(e) The pipeline allows engineers to have their own stack for development using Telepresence, product managers to have their own stack for demos, or anyone else who to have their own stack for testing. A single parameterized Jenkins job creates a new stack based on the latest L3-manifest. Since it uses the L3-manifest, the user has confidence their stack has an L3 level of quality. Currently these individual ephemeral stacks are automatically destroyed after seven days of inactivity.

(f) We have several persistent test stacks used for qualification. While the majority of these stacks are being replaced by the ephemeral test stacks mentioned in (d) above, the pipeline still supports them. Individual service Helm charts are automatically deployed to these stacks at a cadence determined by the stack owners.  

(g) Production deployment is similar to the test stack deployment in (f), but with the "production" profile. Access, schedule, and quality levels also are different than (f).

Helm Chart Profiles

As shown in the pipeline above, we deploy a service's Helm chart to a number of different stacks. A single Helm chart could be deployed to a test stack, a stage stack, an ephemeral stack, minikube for local development, or our production stack. Each stack requires a unique configuration; it may have different URLs, database endpoints, credentials, or resource requirements. We needed a Helm-friendly way to manage these different configurations, so we came up with profiles.

Profiles are not native to Helm charts; they are a construct that we use to encapsulate a chart's unique configuration for a given stack type. The Helm package tool will tar all files in your chart's directory, and Helm install ignores any files it doesn't recognize. We leverage this behavior by attaching all profiles for a service chart at package time.

An example Cradlepoint Helm chart is organized as follows:

Helm requires that default configuration values are located in the root values.yaml file. In Cradlepoint charts, our default values.yaml contain reasonable defaults to run the chart in minikube. This simplifies initial chart development. Profile-specific overrides are in the "profiles" directory. Profiles that require credentials contain a SOPS+KMSencrypted secrets.yaml file. Each secrets.yaml file can only be viewed and modified by developers who have access to the specific AWS KMS key defined in the corresponding .sops.yaml file.

Conceptually, our build scripts deploy a chart to a stack using the following steps. This example assumes the build system has a credential that allows SOPS to decrypt the "qa1" profile's secrets.yaml file.

All configuration changes for all charts in all stacks are made in source code.  After a developer updates a profile's configuration and pushes to git, the CI pipeline runs and packages a new Helm chart and Docker image for that service. The Helm Chart and Docker image are considered immutable artifacts — pushed into a registry that is available for downstream test and deployment processes.

Takeaway

One of the negative consequences of moving to a microservice-based architecture was the proliferation of ways we deployed services to AWS. We are addressing this problem by consolidating all deployments to a single pipeline that uses Docker Images and Helm Charts as artifacts that are deployed to Kubernetes. We extended Helm with our Profiles pattern to deploy these artifacts to different stacks. We use the Helm Umbrella Chart pattern to simplify stack creation for automated testing gates and to allow anyone in R&D to easily create their own application stacks.

We will elaborate on the implementation details of this pipeline in future posts.

“Behind the Code” is a series of blog posts, written by Cradlepoint engineers, about behind-the-scenes topics and issues affecting software development.