Welcome to Kubernetes on AWS’ documentation!

Note

This documentation is only an extract from our internal Zalando documentation. It’s provided in the hope that it helps the Kubernetes community.

Contents:

User’s Guide

How to use the Kubernetes cluster.

Cheat Sheet

You can also download the Kubernetes Cheat Sheet as PDF.

_images/kubernetes-cheat-sheet.png

Labels and Selectors

Labels are key/value pairs that are attached to Kubernetes objects, such as pods (this is usually done indirectly via deployments). Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users. Labels can be used to organize and to select subsets of objects. See Labels and Selectors in the Kubernetes documentation for more information.

The following Kubernetes labels have a defined meaning in our Zalando context:

application
Application ID as defined in our Kio application registry. Example: “zmon-controller”
version
User-defined application version. This is used as input for the CI/CD pipeline and usually references a Docker image tag. Example: “cd53”
release
Incrementing release counter. This is generated by the CI/CD pipeline and is used for traffic switching. Example: “4”
stage
Deployment stage to allow canary deployments. Allowed values are “canary” and “production”.
owner
Owner of the Kubernetes resource. This needs to reference a valid organizational entity in the context of the cluster’s business partner. Example: “team/eagleeye”

Some labels are required for every deployment resource:

  • application
  • version
  • release
  • stage

Example deployment metadata:

metadata:
  labels:
    application: my-app
    version: "v31"
    release: "r42"
    stage: production

Kubernetes services will usually select only on application and stage:

kind: Service
apiVersion: v1
metadata:
  name: my-app
spec:
  selector:
    application: my-app
    stage: production
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP

You can always define additional custom labels as long as they don’t conflict with the above label catalog.

Zalando Platform IAM Integration

Introductory Remark: After the learnings of two years of STUPS/IAM integration, we will integrate a more advanced and simpler solution. The integration will be Kubernetes native and will take complexity out of your application. Instead of providing you with client ID, client secret, username and password that you then have to use to generate tokens regularly, we will provide you a simple way to directly obtain the ready-to-go tokens instead of fiddling with the credentials. Technically speaking, this means you just need to read your current token from a text file in your filesystem and you are done - no complicated token libraries anymore.

The user flow for a new application to get OAuth credentials looks like:

Platform IAM Credentials

The PlatformCredentialsSet resource allows application owners to declare needed OAuth credentials.

apiVersion: "zalando.org/v1"
kind: PlatformCredentialsSet
metadata:
   name: my-app-credentials
spec:
   application: my-app # has to match with registered application in kio/yourturn
   tokens:
     full-access: # token name
       privileges: # privileges/scopes for the token.
         # All zalando-specific privileges start with namespace com.zalando, following pattern <namespace>::<privilege>
         # the privileges/scopes you define here should match those you define for your application in yourturn.
         - com.zalando::foobar.write
         - com.zalando::acme.full
     read-only: # token name
       privileges: # privileges/scopes for the token.
         - com.zalando::foobar.read
   clients:
     employee: # client name
       # the allowed grant type, see https://tools.ietf.org/html/rfc6749
       # options: authorization-code, implicit, resource-owner-password-credentials, client-credentials
       # (values directly reference RFC section titles)
       grant: authorization-code
       # the client's account realm
       # options: users, customers, services
       # ("services" realm should not be used for clients, use the "tokens" section instead!)
       realm: users
       # redirection URI as described in https://tools.ietf.org/html/rfc6749#section-2
       redirectUri: https://example.org/auth/callback

The declared credentials will automatically be provided as a secret with the same name.

Following this example you would get a token called full-access with the privileges com.zalando::foobar.write and com.zalando::acme.full, a token called read-only with privileges com.zalando::foobar.read and a client named employee which uses authorization-code grant under realm users.

Secrets

Automatically generated secrets provide the declared OAuth credentials in the following form:

apiVersion: v1
kind: Secret
metadata:
  name: my-app-credentials
type: Opaque
data:
  full-access-token-type: Bearer
  full-access-token-secret: JwAbc123.. # JWT token
  read-only-token-type: Bearer
  read-only-token-secret: JwBcd456.. # JWT token
  employee-client-id: 67b86a55-61e6-4862-aa14-70fe7be788f4
  employee-client-secret: 5585942c-ce79-44e4-aac2-8af565b51d3e

The secret can conveniently be mounted to read the tokens and client credentials from a volume:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  template:
    metadata:
      labels:
        application: my-app
    spec:
      containers:
      - name: my-app
        image: pierone.stups.zalan.do/myteam/my-app:cd53
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: my-app-credentials
          mountPath: /meta/credentials
          readOnly: true
      volumes:
      - name: my-app-credentials
        secret:
          secretName: my-app-credentials

The application can now simply read the declared tokens from text files, i.e. even a simple Bash script suffices to use OAuth tokens:

#!/bin/bash
type=$(cat /meta/credentials/read-only-token-type)
secret=$(cat /meta/credentials/read-only-token-secret)
curl -H "Authorization: $type $secret" https://resource-server.example.org/protected

Either use one of the supported token libraries or implement the file read on your own. How to read a token in different languages:

# Python
with open('/meta/credentials/{}-token-secret'.format(token_name)) as fd:
    access_token = fd.read().strip()
// JavaScript (node.js)
const accessToken = String(fs.readFileSync(`/meta/credentials/${tokenName}-token-secret`)).trim()
// Java
String accessToken = new String(Files.readAllBytes(Paths.get("/meta/credentials/" + tokenName + "-token-secret"))).trim();

Note

Using the authorization type from the secret instead of hardcoding Bearer allows to transparently switch to HTTP Basic Auth in a different context (e.g. running an Open Source application in a non-Zalando environment). Users would simply need to provide an appropriate secret like:

apiVersion: v1
kind: Secret
metadata:
  name: my-app-credentials
type: Opaque
data:
  full-access-token-type: Basic
  full-access-token-secret: dXNlcjpwYXNzCg== # base64 encoded user:pass
  read-only-token-type: Basic
  read-only-token-secret: dXNlcjpwYXNzCg== # base64 encoded user:pass

Problem Feedback

Providing the requested credentials (tokens, clients) may fail for various reasons:

  • the PlatformCredentialsSet has syntactic errors
  • the application (application property) does not exist or is missing required configuration
  • the application is not allowed to obtain the requested credentials (e.g. missing privileges)
  • some other error occurred

All problems with credential distribution are written to the secret with the same name as the PlatformCredentialsSet:

apiVersion: v1
kind: Secret
metadata:
  name: my-app-credentials
  annotations:
    zalando.org/problems: |
      - type: https://credentials-provider.example.org/not-enough-privileges
        title: Forbidden: Not enough privileges
        status: 403
        instance: tokens/full-access
type: Opaque
data:
  # NOTE: the declared "full-access" token is missing as it was denied
  read-only-token-type: Bearer
  read-only-token-secret: JwBcd456.. # JWT token
  employee-client-id: 67b86a55-61e6-4862-aa14-70fe7be788f4
  employee-client-secret: 5585942c-ce79-44e4-aac2-8af565b51d3e

The zalando.org/problems annotation contains a list of “Problem JSON” objects as defined in RFC 7807 as YAML. At least the fields type, title and instance should be set by the component processing the PlatformCredentialsSet resource:

type
Machine-readable URI reference that identifies the problem type (e.g. https://example.org/invalid-grant)
title
Short, human-readable summary of the problem type (e.g. “Invalid client grant”)
instance
Relative path indicating the problem location, this should reference the token or client (e.g. clients/my-client)

See also the Problem OpenAPI schema YAML.

AWS IAM integration

This section describes how to setup an AWS IAM role which can then be assumed by pods running in a Kubernetes cluster. You only need AWS IAM roles if your application calls the AWS API directly (e.g. to store data in some S3 bucket).

Create IAM Role with AssumeRole trust relationship

In order for an AWS IAM role to be assumed by the worker node and passed on to a pod running on the node, it must allow the worker node IAM role to assume it.

This is achived by adding a trust relation on the role trust relationship policy document. Assuming the account number is 12345678912 and the cluster name is kube-1, the policy document would look like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::12345678912:role/kube-1-worker"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Reference IAM role in pod

In order to use the IAM role in a pod you simply need to reference the role name in an annotation on the pod specification. As an example we can create a simple deployment for an application called myapp which require the IAM role myapp-iam-role:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: myapp
      annotations:
        iam.amazonaws.com/role: myapp-iam-role
    spec:
      containers:
      - name: myapp
        image: myapp:v1.0

To test that the pod gets the correct role you can exec into the container and query the metadata endpoint.

$ zkubectl exec -it myapp-podid -- sh
$ curl -s 169.254.169.254/latest/meta-data/iam/security-credentials/
myapp-iam-role

The response should be the name of the role available from within the pod.

Ingress

This section describes how to expose a service to the internet by defining Ingress rules.

What is Ingress?

Ingress allows to expose a service to the internet by defining its HTTP layer address. Ingress settings include:

  • TLS certification
  • host name
  • path endpoint (optional)
  • service and service port

The Ingress services, when detecting a new or modified Ingress entry, will create/update the DNS record for the defined hostname, will update the load balancer to use a TLS certificate and route the requests to the cluster nodes, and will define the routes that find the right service based on the hostname and the path.

More details about the general Ingress in Kubernetes can be found in the official Ingress Resources.

How to setup Ingress?

Let’s assume that we have a deployment with label application=test-app, providing an API service on port 8080 and an admin UI on port 8081. In order to make them accessible from the internet, we need to create a service first.

Create a service

The service definition looks like this, create it in the apply directory as service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: test-app-service
  labels:
    application: test-app-service
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
    name: main-port
  - port: 8081
    protocol: TCP
    targetPort: 8081
    name: admin-ui-port
  selector:
    application: test-app

Note that we didn’t define the type of the service. This means that the service type will be the default ClusterIP, and will be accessible only from inside the cluster.

Create the Ingress rules

Let’s assume that we want to access this API and admin UI from the internet with the base URL https://test-app.playground.zalan.do, and we want to access the UI on the path /admin while all other endpoints should be directed to the API. We can create the following Ingress entry in the apply directory as ingress.yaml:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: test-app
spec:
  rules:
  - host: test-app.playground.zalan.do
    http:
      paths:
      - backend:
          serviceName: test-app-service
          servicePort: main-port
      - path: /admin
        backend:
          serviceName: test-app-service
          servicePort: admin-ui-port

Once the changes were applied by the pipeline, the API and the admin UI should be accessible at https://test-app.playground.zalan.do and https://test-app.playground.zalan.do/admin. (If the load balancer and/or the DNS entry are newly created, it can take ~1 minute for everything to be ready.) Already provisioned X509 Certificate (IAM and ACM) will be found and matched automatically for your Ingress resource.

Manually selecting a certificate

The right certificate is usually discovered automatically, but there might be occasions where the SSL certificate ID (ARN) needs to be specified manually (e.g. if a CNAME in another account points to our Ingress). Let’s assume we want to hard code our certificate that is used in the ALB to terminate TLS for https://test-app.playground.zalan.do/. We can create the following Ingress entry in the apply directory as ingress.yaml:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: test-app
  annotations:
    zalando.org/aws-load-balancer-ssl-cert: <certificate ARN>
spec:
  rules:
  - host: test-app.playground.zalan.do
    http:
      paths:
      - backend:
          serviceName: test-app-service
          servicePort: main-port
Certificate ARN

In the above template, the token <certificate ARN> is meant to be replaced with the ARN of a valid certificate available for your account. You can find the right certificate in one of the following two ways:

1. For standard IAM certificates:

aws iam list-server-certificates

... should display something like this:

{
    "ServerCertificateMetadataList": [
        {
            "ServerCertificateId": "ABCDEFGHIJKLMNOPFAKE1",
            "ServerCertificateName": "self-signed-cert1",
            "Expiration": "2026-12-13T08:31:06Z",
            "Path": "/",
            "Arn": "arn:aws:iam::123456789012:server-certificate/self-signed-cert1",
            "UploadDate": "2016-12-15T08:48:03Z"
        },
        {
            "ServerCertificateId": "ABCDEFGHIJKLMNOPFAKE2",
            "ServerCertificateName": "self-signed-cert2",
            "Expiration": "2026-12-13T08:51:22Z",
            "Path": "/",
            "Arn": "arn:aws:iam::123456789012:server-certificate/self-signed-cert2",
            "UploadDate": "2016-12-15T08:51:41Z"
        },
        {
            "ServerCertificateId": "ABCDEFGHIJKLMNOPFAKE3",
            "ServerCertificateName": "teapot-zalan-do",
            "Expiration": "2023-05-11T00:00:00Z",
            "Path": "/",
            "Arn": "arn:aws:iam::123456789012:server-certificate/teapot-zalan-do",
            "UploadDate": "2016-05-12T12:26:52Z"
        }
    ]
}

...where you want to use the Arn values.

2. For Amazon Certificate Manager (ACM) certificates:

aws acm list-certificates

...should print something like this:

{
    "CertificateSummaryList": [
        {
            "CertificateArn": "arn:aws:acm:eu-central-1:123456789012:certificate/12345678-1234-1234-1234-123456789012",
            "DomainName": "teapot.zalan.do"
        },
        {
            "CertificateArn": "arn:aws:acm:eu-central-1:123456789012:certificate/12345678-1234-1234-1234-123456789012",
            "DomainName": "*.teapot.zalan.do"
        }
    ]
}

...where you want to use the CertificateArn values.

Alternatives

You can expose an application with its own load balancer, described in the TLS Termination and DNS. The two methods can live next to each other, but they need to have separate service definitions (due to the different service types).

Container resource limits

Note

This is a preliminary summary from skimming docs and educational guessing. No evaluation done. It could contain errors.

Resource definitions

There are two supported resource types: cpu and memory. In future versions of Kubernetes one will be able to add custom resource types and the current implementation might be based on that.

CPU resources are measured in virtual cores or more commonly in “millicores” (e.g. 500m denoting 50% of a vCPU). Memory resources are measured in Bytes and the usual suffixes can be used, e.g. 500Mi denoting 500 Mebibyte.

For each resource type there are two kinds of definitions: requests and limits. Requests and limits are defined per container. Since the unit of scheduling is a pod one needs to sum them up to get the requests and limits of a pod.

The resulting four combinations are explained in more detail below.

Resource requests

In general, requests are used by the scheduler to find a node that has free resources to take the pod. A node is full when the sum of all requests equals the registered capacity of that node in any resource type. So, if the requests of a pod are still unclaimed on a node, the scheduler can schedule a pod there.

Note that this is the only metric the scheduler uses (in that context). It doesn’t take the actual usage of the pods into account (which can be lower or higher than whatever is defined in requests).

Memory requests
Used for finding nodes with enough memory and making better scheduling decisions.
CPU requests
Maps to the docker flag --cpu-shares, which defines a relative weight of that container for CPU time. The relative share is executed per core, which can lead to unexpected outcomes but probably nothing to worry about in our use cases. A container will never be killed because of this metric.
Resource limits

Limits define the upper bound of resources a container can use. Limits must always be greater or equal to requests. The behavior differs between CPU and memory.

Memory limits
Maps to the docker flag --memory, which means processes in the container get killed by the kernel if they hit that memory usage (OOMKilled). Given you run one process per container this will kill the whole container and Kubernetes will try to restart it.
CPU limits
Maps to the docker flag --cpu-quota, which limits CPU time of that container’s processes. Seems like you can define that a container can only max utilize a core by e.g. 50%. But, let’s assume you have 3 of them on a single-core node this can lead to over-utilizing it.

Conclusion

  • requests are for making scheduling decisions
  • limits are real resource limits of containers
  • effect of CPU limits is not completely straight forward to understand
  • choosing higher limits than requests allows to over-provision nodes, but has the danger of over-utilizing it
  • requests are required for using the horizontal pod autoscaler

Persistent Storage

Some of your pods need to persist data across pod restarts (e.g. databases). In order to facilitate this we can mount folders into our pods that are backed by EBS volumes on AWS.

Deploying Redis

In this example we’re going to deploy a non high-available but persistent Redis container.

We start out by deploying a non-persistent version first and then extend it to keep our data across pod and node restarts. Submit the following two manifests to your cluster to create a deployment and a service for your redis instance.

apiVersion: v1
kind: Service
metadata:
  name: redis
spec:
  ports:
  - port: 6379
    targetPort: 6379
  selector:
    application: redis
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: redis
spec:
  replicas: 1
  template:
    metadata:
      labels:
        application: redis
        version: 3.2.5
    spec:
      containers:
      - name: redis
        image: redis:3.2.5

Your service can be accessed from other pods by using the automatically generated cluster-internal DNS name or service IP address. So given you use the manifests as printed above and you’re running in the default namespace you should find your Redis instance at redis.default.svc.cluster.local from any other pod.

You can run an interactive pod and test that it works. You can use the same Redis image as it contains the redis CLI.

$ zkubectl run redis-cli --rm -ti --image=redis:3.2.5 --restart=Never /bin/bash
$ redis-cli -h redis.default.svc.cluster.local
redis-default.hackweek.zalan.do:6379> quit
Creating a volume

There’s one major problem with your Redis container: It lacks some persistent storage. So let’s add it.

We’ll be using something that’s called a PersistentVolumeClaim. Claims are an abstraction over the actual storage system in your cluster. With a claim you define that you need some amount of storage at some path inside your container. Based on your needs the cluster management system will provision you some storage out of its available storage pool. In case of AWS you usually get an EBS volume attached to the node and mounted into your container.

Submit the following file to your cluster in order to claim 10GB of standard storage.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-data
  annotations:
    volume.beta.kubernetes.io/storage-class: standard
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

standard is a storage class that we defined in the cluster. It’s implemented via an SSD-EBS volume. ReadWriteOnce means that this storage can only be attached to one instance at a time. Both of these values can be safely ignored, more important for you are the name and the requested size of storage.

After submitting the manifest to the cluster you can list your storage claims:

$ zkubectl get persistentVolumeClaims
NAME            STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
redis-data      Bound     pvc-fc26de82-b577-11e6-b2a5-02c15a33e7b7   10Gi       RWO           4s

Status Bound means that your claim was successfully implemented and is now bound to a persistent volume. You can also list all volumes:

$ zkubectl get persistentVolumes
NAME                                       CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM                      REASON    AGE
pvc-fc26de82-b577-11e6-b2a5-02c15a33e7b7   10Gi       RWO           Delete          Bound     default/redis-data                   8m

If you want to dig deeper you can describe the volume and see that it’s backed by an EBS volume.

$ zkubectl describe persistentVolume pvc-fc26de82-b577-11e6-b2a5-02c15a33e7b7
Name:               pvc-fc26de82-b577-11e6-b2a5-02c15a33e7b7
Labels:             failure-domain.beta.kubernetes.io/region=eu-central-1
    failure-domain.beta.kubernetes.io/zone=eu-central-1b
Status:             Bound
Claim:              default/redis-data
Reclaim Policy:     Delete
Access Modes:       RWO
Capacity:   10Gi
Message:
Source:
    Type:   AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:       aws://eu-central-1b/vol-a36c7039
    FSType: ext4
    Partition:      0
    ReadOnly:       false
No events.

Here, you can also see in which zone the EBS volume was created. Any pod that wants to mount this volume must be scheduled to a node running in that same zone. Luckily, Kubernetes takes care of that.

Attaching a volume to a pod

Modify your deployment in the following way in order to use the persistent volume claim we created above.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: redis
spec:
  replicas: 1
  template:
    metadata:
      labels:
        application: redis
        version: 3.2.5
    spec:
      containers:
      - name: redis
        image: redis:3.2.5
        volumeMounts:
        - mountPath: /data
          name: redis-data
      volumes:
        - name: redis-data
          persistentVolumeClaim:
            claimName: redis-data

We did two things here: First we registered the persistentVolumeClaim under the volumes section in the pod definition and gave it a name. Then, by using the name, we mounted that volume under a path in the container in the volumeMounts section. The reason for having a two-level definition here is because multiple containers in the same pod can mount the same volume under different paths, e.g. for sharing data.

Secondly, our Redis container uses /data to store its data which is where we mounted our persistent volume. This way, anything that Redis stores will be written to the EBS volume and thus can be mounted on another node in case of node failure.

Note, that you usually want replicas to be 1 when using this approach. Though, you can use more replicas which would result in many pods mounting the same volume. As this volume is backed by an EBS volume this forces Kubernetes to schedule all replicas on the same node. If you require multiple replicas, each with their own persistent volume, you should rather think about using a StatefulSet instead.

Trying it out

Find out where your pod currently runs:

$ zkubectl get pods -o wide
  NAME                        READY     STATUS    RESTARTS   AGE       IP          NODE
  redis-3548935762-qevsk      1/1       Running   0          2m        10.2.1.66   ip-172-31-15-65.eu-central-1.compute.internal

The node it landed on is ip-172-31-15-65.eu-central-1.compute.internal. Connect to your Redis endpoint and create some data:

$ zkubectl run redis-cli --rm -ti --image=redis:3.2.5 --restart=Never /bin/bash
$ redis-cli -h redis.default.svc.cluster.local
redis-default.hackweek.zalan.do:6379> set foo bar
OK
redis-default.hackweek.zalan.do:6379> get foo
"bar"
redis-default.hackweek.zalan.do:6379> quit

Simulate a pod failure by deleting your pod. This will make Kubernetes create a new one potentially on another node but always in the same zone due to using an EBS volume.

$ zkubectl delete pod redis-3548935762-qevsk
pod "redis-3548935762-qevsk" deleted

$ zkubectl get pods -o wide
NAME                        READY     STATUS    RESTARTS   AGE       IP          NODE
redis-3548935762-p4z9y      1/1       Running   0          1m        10.2.72.2   ip-172-31-10-115.eu-central-1.compute.internal

In this example the new pod landed on another node (ip-172-31-10-115.eu-central-1.compute.internal). Let’s check that it’s available and didn’t loose any data. Connect to Redis in the same way as before.

$ zkubectl run redis-cli --rm -ti --image=redis:3.2.5 --restart=Never /bin/bash
$ redis-cli -h redis.default.svc.cluster.local
redis-default.hackweek.zalan.do:6379> get foo
"bar"
redis-default.hackweek.zalan.do:6379> quit

And indeed, everything is still there.

Deleting a volume

All it takes to delete a volume is to delete the corresponding claim that initiated its creation in the first place.

$ zkubectl delete persistentVolumeClaim redis-data
persistentvolumeclaim "redis-data" deleted

To fully clean up after yourself also delete the deployment and the service:

$ zkubectl delete deployment,service redis
service "redis" deleted
deployment "redis" deleted

Logging

Zalando cluster will ship logs to Scalyr for all containers running on a cluster node. The logs will include extra attributes/tags/metadata depending on deployment manifests. Whenever a new container starts on a cluster node, its logs will be shipped.

Note

Logs are shipped per container and not per application. To view all logs from certain application you can use Scalyr UI https://www.scalyr.com/events and filter using Logs attributes.

One Scalyr account will be provisioned for each community, i.e. the same Scalyr account is used for both test and production clusters.

You need to make sure the minimum requirements are satisfied to start viewing logs on Scalyr.

Requirements

Logging output

Always make sure your application logs to stdout & stderr. This will allow cluster log shipper to follow application logs, and also allows you to follow logs via Kubernetes native logs command.

$ zkubectl logs -f my-pod-name my-container-name
Labels

In order for the container logs to be shipped, your deployment must include the following metadata labels:

  • application
  • version

Logs attributes

All logs are shipped with extra attributes that can help in filtering from Scalyr UI (or API). Usually those extra fields are extracted from deployment labels, or the Kubernetes cluster/API.

application
Application ID. Retrieved from metadata labels.
version
Application version. Retrieved from metadata labels.
release
Application release. Retrieved from metadata labels. [optional]
cluster
Cluster ID. Retrieved from Kubernetes cluster.
container
Container name. Retrieved from Kubernetes API.
node
Cluster node running this container. Retrieved from Kubernetes cluster.
pod
Pod name running the container. Retrieved from Kubernetes cluster.
namespace
Namespace running this deployment(pod). Retrieved from Kubernetes cluster.

Log parsing

The default parser for application logs is the json parser. In some cases however you might want to use a custom Scalyr parser for your application. This can be achieved via Pod annotations.

However, the json parser only parses the JSON generated from the Docker logs. If your application generates logs in JSON, the default parser will only see them as an escaped string of JSON. However, Scalyr provides a special parser escapedJson for that.

Scalyr’s default parser can even be configured to also make a pass with the escapedJson parser. That way there is no need to configure anything on a per application level to get properly parsed fields from JSON based application logs in Scalyr. Just edit the JSON parser to contain the following config.

// Parser for log files containing JSON records.
{
   attributes: {
     // Tag all events parsed with this parser so we can easily select them in queries.
     dataset: "json"
   },

   formats: [
     {format: "${parse=json}$", repeat: true},
     {format: "\\{\"log\":\"$log{parse=escapedJson}$", repeat: true}
   ]
 }

The following example shows how to annotate a pod to instruct the log watcher to use the custom parser json-java-parser for pod container my-app.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  template:
    metadata:
      labels:
        application: my-app
      annotations:
        # specify scalyr log parser
        kubernetes-log-watcher/scalyr-parser: '[{"container": "my-app-container", "parser": "json-java-parser"}]'
    spec:
      containers:
      - name: my-app-container
        image: pierone.stups.zalan.do/myteam/my-app:cd53
        ports:
        - containerPort: 8080

The value of kubernetes-log-watcher/scalyr-parser annotation should be a JSON serialized list. If container value did not match, then it will fall back to the default parser (i.e. json).

Note

You need to specify the container in the parser annotation because you can have multiple containers in a pod which may use different log formats.

Running in Production

Number of Replicas

Always run at least two replicas (three or more are recommended) of your application to survive cluster updates and autoscaling without downtime.

Readiness Probes

Web applications should always configure a readinessProbe to make sure that the container only gets traffic after a successful startup:

containers:
- name: mycontainer
  image: myimage
  readinessProbe:
    httpGet:
      # Path to probe; should be cheap, but representative of typical behavior
      path: /.well-known/health
      port: 8080
    timeoutSeconds: 1

See Configuring Liveness and Readiness Probes for details.

Resource Requests

Always configure resource requests for both CPU and memory. The Kubernetes scheduler and cluster autoscaler need this information in order to make the right decisions. Example:

containers:
  - name: mycontainer
    image: myimage
    resources:
      requests:
        cpu: 100m     # 100 millicores
        memory: 200Mi # 200 MiB

Resource Limits

You should configure a resource limit for memory if possible. The memory resource limit will get your container OOMKilled when reaching the limit. Set the JVM heap memory dynamically by using the java-dynamic-memory-opts script from Zalando’s OpenJDK base image and setting MEM_TOTAL_KB to limits.memory:

containers:
  - name: mycontainer
    image: myjvmdockerimage
    env:
      # set the maximum available memory as JVM would assume host/node capacity otherwise
      # this is evaluated by java-dynamic-memory-opts in the Zalando OpenJDK base image
      # see https://github.com/zalando/docker-openjdk
      - name: MEM_TOTAL_KB
        valueFrom:
          resourceFieldRef:
            resource: limits.memory
            divisor: 1Ki
    resources:
      requests:
        cpu: 100m
        memory: 2Gi
      limits:
        memory: 2Gi

Example application with IAM credentials

Note

This section describes the legacy way of getting OAuth credentials via Mint. Please read Zalando Platform IAM Integration for the recommended new approach.

This is a full example manifest of an application (myapp) which uses IAM credentials distributed via a mint-bucket (zalando-stups-mint-12345678910-eu-central-1).

Here is an example of a policy that grants access to the specific folder in the Mint’s S3 bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Resource": [
        "arn:aws:s3:::zalando-stups-mint-12345678910-eu-central-1/myapp/*"
      ],
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Sid": "AllowMintRead"
    }
  ]
}

In this example the AWS access role for the S3 bucket is called myapp-iam-role (See also AWS IAM integration for how to correctly setup such a role in AWS):

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: myapp
      annotations:
        iam.amazonaws.com/role: myapp-iam-role
    spec:
      containers:
        - name: myapp
          image: myapp:v1.0.0
          env:
            - name: CREDENTIALS_DIR
              value: /meta/credentials
          volumeMounts:
            - name: credentials
              mountPath: /meta/credentials
              readOnly: true
        - name: gerry
          image: registry.opensource.zalan.do/teapot/gerry:v0.0.9
          args:
            - /meta/credentials
            - --application-id=myapp
            - --mint-bucket=s3://zalando-stups-mint-12345678910-eu-central-1
          volumeMounts:
            - name: credentials
              mountPath: /meta/credentials
              readOnly: false
      volumes:
        - name: credentials
          emptyDir:
            medium: Memory # share a tmpfs between the two containers

The first important part of the manifest is the annotations section:

annotations:
  iam.amazonaws.com/role: myapp-iam-role

Here we specify the role needed in order for the pod to get access to the S3 bucket with the credentials.

The next important part is the gerry sidecar.

- name: gerry
  image: registry.opensource.zalan.do/teapot/gerry:v0.0.9
  args:
    - /meta/credentials
    - --application-id=myapp
    - --mint-bucket=s3://zalando-stups-mint-12345678910-eu-central-1
  volumeMounts:
    - name: credentials
      mountPath: /meta/credentials
      readOnly: false

The gerry sidecar container mounts the shared credentials mount point under /meta/credentials and writes the credential files user.json and client.json to this location.

To read these files from the myapp container, the shared credentials mount point is also mounted into the myapp container.

- name: myapp
  image: myapp:v1.0.0
  env:
    - name: CREDENTIALS_DIR
      value: /meta/credentials
  volumeMounts:
    - name: credentials
      mountPath: /meta/credentials
      readOnly: true

TLS Termination and DNS

This section describes how to expose a service via TLS to the internet.

Note

You usually want to use Ingress instead to automatically expose your application with TLS and DNS.

Expose your app

Let’s deploy a simple web server to test that our TLS termination works.

Submit the following yaml files to your cluster.

Note that this guide uses a top-down approach and starts with deploying the service first. This allows Kubernetes to better distribute pods belonging to the same service across the cluster to ensure high availability. You can, however, submit the files in any order you like and it will work. It’s all declaritive.

Create a service

Create a Service of type LoadBalancer so that your pods become accessible from the internet through an ELB. For TLS termination to work you need to annotate the service with the ARN of the certificate you want to serve.

apiVersion: v1
kind: Service
metadata:
  name: nginx
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:eu-central-1:some-account-id:certificate/some-cert-id
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
spec:
  type: LoadBalancer
  ports:
  - port: 443
    targetPort: 80
  selector:
    app: nginx

This creates a logical service called nginx that forwards all traffic to any pods that match the label selector app=nginx, which we haven’t created yet. The service (logically) listens on port 443 and forwards to port 80 on each of the upstream pods, which is where the nginx processes will listen on.

We also define the protocol that our upstreams use. Often your upstreams will just speak plain HTTP so the second annotation’s value is actually the default value and can be omitted.

Make sure to define your service to listen on port 443 as this will be used as the listening port for your ELB.

Wait for a couple of minutes for AWS to provision an ELB for you and for DNS to propagate. Check the list of services to find out the endpoint of the ELB that was created for you.

$ zkubectl get svc -o wide
NAME      CLUSTER-IP   EXTERNAL-IP                                     PORT(S)   AGE       SELECTOR
nginx     10.3.0.245   some-long-hash.eu-central-1.elb.amazonaws.com   443/TCP   6m        app=nginx
Create the deployment

Now let’s deploy some pods that actually implement our service.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80

This creates a deployment called nginx that will ensure to run two copies of the nginx image from dockerhub listening on port 80. They match exactly the labels that our service is looking for so they are dynamically added to the service’s pool of upstreams.

Make sure your pods are running.

$ zkubectl get pods
NAME                     READY     STATUS    RESTARTS   AGE
nginx-1447934386-iblb3   1/1       Running   0          7m
nginx-1447934386-jj559   1/1       Running   0          7m

Now curl the service endpoint. You’ll get a certificate warning since the hostname doesn’t match the served certificate.

$ curl --insecure https://some-long-hash.eu-central-1.elb.amazonaws.com
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...
</body>
</html>

DNS records

For convenience you can assign a DNS name for your service so you don’t have to use the arbitrary ELB endpoints. The DNS name can be specified by adding an additional annotation to your service containing the desired DNS name.

apiVersion: v1
kind: Service
metadata:
  name: nginx
  annotations:
    external-dns.alpha.kubernetes.io/hostname: my-nginx.playground.zalan.do
spec:
  ...

Note that although you specify the full DNS name here you must pick a name that is inside the zone of the cluster, e.g. in this case *.playground.zalan.do. Also keep in mind that when doing this you can clash with other users’ service names.

Make sure it works:

$ curl https://my-nginx.playground.zalan.do
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...
</body>
</html>

For reference, the full service description should look like this:

apiVersion: v1
kind: Service
metadata:
  name: nginx
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:eu-central-1:some-account-id:certificate/some-cert-id
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
    external-dns.alpha.kubernetes.io/hostname: my-nginx.playground.zalan.do
spec:
  type: LoadBalancer
  ports:
  - port: 443
    targetPort: 80
  selector:
    app: nginx

Common pitfalls

When accessing your service from another pod make sure to specify both port and protocol

Kubernetes clusters usually run an internal DNS server that allows you to reference services from inside the cluster via DNS names rather than IPs. The internal DNS name for this example is nginx.default.svc.cluster.local. So, from inside any pod of the cluster you can lookup your service with:

$ dig +short nginx.default.svc.cluster.local
10.3.0.245

But don’t get confused due to the mixed ports: Your service just forwards to the plain HTTP endpoints of your nginxs but serves them on port 443, as HTTP. So to avoid confusion when accessing your service from another pod make sure to specify both port and protocol.

$ curl http://nginx.default.svc.cluster.local:443
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...
</body>
</html>

Note that we use HTTP on port 443 here.

Service accounts

In Kubernetes, service accounts are used to provide an identity for pods. Pods that want to interact with the API server will authenticate with a particular service account. By default, applications will authenticate as the default service account in the namespace they are running in. This means, for example, that an application running in the test namespace will use the default service account of the test namespace.

Access Control

Applications are authorized to perform certain actions based on the service account selected. We currently allow the following service accounts:

kube-system:system
Used only for admin access in kube-system namespace.
kube-system:default
Used for read only access in the kube-system namespace.
default:default
Gives read-only access to the Kubernetes API.
*:operator
Gives full access to the used namespace and read-write access to TPR, storage classes, persistent volumes in all namespaces.

Additional service accounts are used by the Kubernetes’ controller manager to allow it to work properly.

How to create service accounts

Service accounts can be created for your namespace via pipelines (or via zkubectl in test clusters) by placing the respective YAML in the apply folder and executing it. For example, to request operator access you will need to create the following service account:

apiVersion: v1
kind: ServiceAccount
imagePullSecrets:
- name: pierone.stups.zalan.do  # required to pull images from private registry
metadata:
  name: operator
  namespace: $YOUR_NAMESPACE

The service account can be used in an example deployment like this:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx
  namespace: acid
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
      serviceAccountName: operator  #this is where your service account is specified
      hostNetwork: true

FAQ

How do I...

... ensure that my application runs in multiple Availability Zones?
The Kubernetes scheduler will automatically try to distribute pods across multiple “failure domains” (the Kubernetes term for AZs).
... use the AWS API from my application on Kubernetes?
Create an IAM role via CloudFormation and assign it to your application pods. The AWS SDKs will automatically use the assigned IAM role. See AWS IAM integration for details.
... get OAuth access tokens in my application on Kubernetes?
Your application can declare needed OAuth credentials (tokens and clients) via the PlatformCredentialsSet. See Zalando Platform IAM Integration for details.
... read the logs of my application?
The most convenient way to read your application’s logs (stdout and stderr) is by filtering by the application label in the Scalyr UI. See Logging for details.
... get access to my Scalyr account?
You can approach one of your colleagues who already has access to invite you to your Scalyr account.
... switch traffic gradually to a new application version?
Traffic switching is currently implemented by scaling up the new deployment and scaling down the old version. This process is fully automated and cannot be controlled in the current CI/CD Jenkins pipeline. The future deployment infrastructure will probably support manual traffic switching.
... use different namespaces?
We recommend using the “default” namespace, but you can create your own if you want to.
... quickly get write access to production clusters for 24x7 emergencies?
We still need to set up the Emergency Operator workflow: the idea is to quickly give full access to production accounts and clusters in case of incidents. Eric’s idea is to require a real 24x7 INCIDENT ticket for getting access (this would ensure that it’s not misused for day-to-day work). Right now (2017-05-15) you can call STUPS 24x7 2nd level (via 1st level) to ask for emergency access.
... use a single Jenkins for both building (CI) and deployment (CD)? That would enable more sophisticated pipelines because no extra communication between CI and CD was needed. CI needs feedback look from CD in order to perform joint activities.
The Jenkins setup will be replaced by the Continuous Delivery Platform which performs both builds and deploys. See https://pages.github.bus.zalan.do/continuous-delivery/cdp-docs/ and watch out for announcements and Friday Demos.
... test deployment YAMLs from CLI?
$ zdeploy render-template deployment.yaml application=xxx version=xx | zkubectl create
... access a production service from my test cluster?
Test clusters are not allowed to get production OAuth credentials, please use a staging service and sandbox OAuth credentials.
... decide when to place a declaration under the apply folder, and when at the root (it doesn’t seem to be standard)?
The current Jenkins CI/CD pipeline relies on some Zalando conventions: every .yaml file in the apply folder is applied as a Kubernetes manifest or Cloud Formation template. Some files need to be on the “root” folder as they are processed in a special way, these files are e.g.: deployment.yaml, autoscaling.yaml and pipeline.yaml.
... use Helm together with Kubernetes on AWS?
We don’t currently (May 2017) support it because it requires the installation of some components in the kube-system namespace. This namespace is reserved for core cluster components as defined in the Kubernetes on AWS configuration and is not accessible to users. Furthermore, the Zalando “compliance by default” requirements (delivering stacks over declarations in a Zalando git repo) would clash with Helm defaults.

Will the cluster scale up automatically and quickly in case of surprise need of more pods?

Cluster autoscaling is purely based on resource requests, i.e. as soon as the resource requests increase (e.g. because the number of pods goes up) the autoscaler will set a new DesiredCapacity of the ASG. The autoscaler is very simple and not based on deltas, but on absolute numbers, i.e. it will potentially scale up by many nodes at once (not one by one). See https://github.com/hjacobs/kube-aws-autoscaler#how-it-works

Admin’s Guide

How to create, update and operate Kubernetes clusters.

Running Kubernetes in Production

Tip

Start by watching our meetup talk “Kubernetes on AWS at Europe’s Leading Online Fashion Platform” on YouTube, to learn how we run Kubernetes on AWS in production. (slides)

This document should briefly describe our learnings in Zalando Tech while running Kubernetes on AWS in production. As we just recently started to migrate to Kubernetes, we consider ourselves far from being experts in the field. This document is shared in the hope that others in the community can benefit from our learnings.

Context

We are a team of infrastructure engineers provisioning Kubernetes clusters for our Zalando Tech delivery teams. We plan to have more than 30 production Kubernetes clusters. The following goals might help to understand the remainder of the document, our Kubernetes setup and our specific challenges:

  • No manual operations: all cluster updates and operations need to be fully automated.
  • No pet clusters: clusters should all look the same and not require any specific configurations/tweaking
  • Reliability: the infrastructure should be rock-solid for our delivery teams to entrust our clusters with their most critical applications
  • Autoscaling: clusters should automatically adapt to deployed workloads and hourly scaling events are expected
  • Seamless migration: Dockerized twelve-factor apps currently deployed on AWS/STUPS should work without modifications on Kubernetes

Cluster Provisioning

There are many tools out there to provision Kubernetes clusters. We chose to adapt kube-aws as it matches our current way of working on AWS: immutable nodes configured via cloud-init and CloudFormation for declarative infrastructure. CoreOS’ Container Linux perfectly matches our understanding of the node OS: only provide what is needed to run containers, not more.

Only one Kubernetes cluster is created per AWS account. We create separated AWS accounts/clusters for production and test environments.

We always create two AWS Auto Scaling Groups (ASGs, “node pools”) right now:

  • One master ASG with always two nodes which run the API server and controller-manager
  • One worker ASG with 2 to N nodes to run application pods

Both ASGs span multiple Availability Zones (AZ). The API server is exposed with TLS via a “classic” TCP/SSL Elastic Load Balancer (ELB).

We use a custom built Cluster Registry REST service to manage our Kubernetes clusters. Another component (Cluster Lifecycle Manager, CLM) is regularly polling the Cluster Registry and updating clusters to the desired state. The desired state is expressed with CloudFormation and Kubernetes manifests stored in git.

_images/cluster-lifecycle-manager.svg

Different clusters can use different channel configurations, i.e. some non-critical clusters might use the “alpha” channel with latest features while others rely on the “stable” channel. The channel concept is similar to how CoreOS manages releases of Container Linux.

Clusters are automatically updated as soon as changes are merged into the respective branch. Configuration changes are first tested in a separate feature branch, afterwards the pull request to the “dev” branch (channel) is automatically tested end-to-end (this includes the official Kubernetes conformance tests).

_images/cluster-updates.svg

AWS Integration

We provision clusters on AWS and therefore want to integrate with AWS services where possible. The kube2iam daemon conveniently allows to assign an AWS IAM role to a pod by adding an annotation. Our infrastructure components such as the autoscaler use the same mechanism to access the AWS API with special (restricted) IAM roles.

Ingress

There is no official way of implementing Ingress on AWS. We decided to create a new component Kube AWS Ingress Controller to achieve our goals:

  • SSL termination by ALB: convenient usage of ACM (free Amazon CA) and certificates upload to AWS IAM
  • Using the “new” ELBv2 Application Load Balancer
_images/kube-aws-ingress-controller.svg

We use Skipper as our HTTP proxy to route based on Host header and path. Skipper is running as a DaemonSet on all worker nodes for convenient AWS ASG integration (new nodes are automatically registered in the ALB’s Target Group). Skipper directly comes with a Kubernetes data client to automatically update its routes periodically.

External DNS is automatically configuring the Ingress hosts as DNS records in Route53 for us.

Resources

Understanding the Kubernetes resource requests and limits is crucial.

Default resource requests and limits can be configured via the LimitRange resource. This can prevent “stupid” incidents like JVM deployments without any settings (no memory limit and no JVM heap set) eating all the node’s memory. We currently use the following default limits:

$ kubectl describe limits
Name:       limits
Namespace:  default
Type        Resource    Min Max  Default Request Default Limit Max Limit/Request Ratio
----        --------    --- ---- --------------- ------------- -----------------------
Container   cpu         -   16   100m            3             -
Container   memory      -   64Gi 100Mi           1Gi           -

The default limit for CPU is 3 cores as we discovered that this is a sweet spot for JVM apps to startup quickly. See our LimitRange YAML manifest for details.

We provide a tiny script and use the Downwards API to conveniently run JVM applications on Kubernetes without the need to manually set the maximum heap size. The container spec of a Deployment for some JVM app would look like this:

# ...
env:
  # set the maximum available memory as JVM would assume host/node capacity otherwise
  # this is evaluated by java-dynamic-memory-opts in the Zalando OpenJDK base image
  # see https://github.com/zalando/docker-openjdk
  - name: MEM_TOTAL_KB
    valueFrom:
      resourceFieldRef:
        resource: limits.memory
        divisor: 1Ki
resources:
  limits:
    memory: 1Gi

Kubelet can be instructed to reserve a certain amount of resources for the system and for Kubernetes components (kubelet itself and Docker etc). Reserved resources are subtracted from the node’s allocatable resources. This improves scheduling and makes resource allocation/usage more transparent. Node allocatable resources or rather reserved resources are also visible in Kubernetes Operational View:

_images/kube-ops-view-reserved-resources.png

Graceful Pod Termination

Kubernetes will cause service disruptions on pod terminations by default as applications and configuration need to be prepared for graceful shutdown. By default, pods receive the TERM signal and kube-proxy reconfigures the iptables rules to stop traffic to the pod. The pod will be killed 30s later by a KILL signal if it did not terminate by itself before.

Kubernetes expects the container to handle the TERM signal and at least wait some seconds for kube-proxy to change the iptables rules. Note that the readinessProbe behavior does not matter after having received the TERM signal.

There are two cases leading to failing requests:

  • The pod’s container terminates immediately when receiving the TERM signal — thus not giving kube-proxy enough time to remove the forwarding rule
  • Keep-alive connections are not handed over by Kubernetes, i.e. requests from clients with keep-alive connection will still be routed to the pod

Keep-alive connections are the default when using connection pools. This means that nearly all client connections between microservices are affected by pod terminations.

Kubernetes’ default behavior is a blocker for seamless migration from our AWS/STUPS infrastructure to Kubernetes. In STUPS, single Docker containers run directly on EC2 instances. Graceful container termination is not needed as AWS automatically deregisters EC2 instances and drains connections from the ELB on instance termination. We therefore consider solving the graceful pod termination issue in Kubernetes on the infrastructure level. This would not require any application code changes by our users (application developers).

For further reading on the topic, you can find a blog post about graceful shutdown of node.js on Kubernetes and a small test app to see the pod termination behavior.

Autoscaling

Pod Autoscaling

We are using the HorizontalPodAutoscaler resource to scale the number of deployment replicas. Pod autoscaling requires implementing graceful pod termination (see above) to downscale safely in all circumstances. We only used the CPU-based pod autoscaling until now.

Node Autoscaling

Our experimental AWS Autoscaler is an attempt to implement a simple and elastic autoscaling with AWS Auto Scaling Groups.

Graceful node shutdown is required to allow safe downscaling at any time. We simply added a small systemd unit to run kubectl drain on shutdown.

Upscaling or node replacement poses the risk of race conditions between application pods and required system pods (DaemonSet). We have not yet figured out a good way of postponing application scheduling until the node is fully ready. The kubelet’s Ready condition is not enough as it does not ensure that all system pods such as kube-proxy and kube2iam are running. One idea is using taints during node initialization to prevent application pods to be scheduled until the node is fully ready.

Monitoring

We use our Open Source ZMON monitoring platform to monitor all Kubernetes clusters. ZMON agent and workers are part of every Kubernetes cluster deployment. The agent automatically pushes both AWS and Kubernetes entities to the global ZMON data service. The Prometheus Node Exporter is deployed on every Kubernetes node (as a DaemonSet) to expose system metrics such as disk space, memory and CPU to ZMON workers. Another component kube-state-metrics is deployed in every cluster to expose cluster-level metrics such as number of waiting pods. ZMON workers also have access to the internal Kubernetes API server endpoint to build more complex checks. AWS resources can be monitored by using ZMON’s CloudWatch wrapper. We defined global ZMON checks for cluster health, e.g.:

  • Number of ready and unschedulable nodes (collected via API server)
  • Disk, memory and CPU usage per node (collected via Prometheus Node Exporter and/or CloudWatch)
  • Number of endpoints per Kubernetes service (collected via API server)
  • API server requests and latency (collected via API server metrics endpoint)

We use Kubernetes Operational View for ad-hoc insights and troubleshooting.

Jobs

We use the very convenient Kubernetes CronJob resource for various tasks such as updating all our SSH bastion hosts every week.

Kubernetes jobs are not cleaned up by default and completed pods are never deleted. Running jobs frequently (like every few minutes) quickly thrashes the Kubernetes API server with unnecessary pod resources. We observed a significant slowdown of the API server with increasing number of completed jobs/pods hanging around. To mitigate this, A small kube-job-cleaner script runs as a CronJob every hour and cleans up completed jobs/pods.

Security

We authorize access to the API server via a proprietary webhook which verifies the OAuth Bearer access token and looks up user’s roles via another small REST services (backed historically by LDAP).

Access to etcd should be restricted as it holds all of Kubernetes’ cluster data thus allowing tampering when accessed directly.

We use flannel as our overlay network which requires etcd by default to configure its network ranges. There is experimental support for the flannel backend to be switched to the Kubernetes API server. This allows restricting etcd access to the master nodes.

Kubernetes allows to define PodSecurityPolicy resources to restrict the use of “privileged” containers and similar features which allow privilege escalation.

Docker

Docker is often beautiful and sometimes painful, especially when trying to run containers reliable in production. We encountered various issues with Docker and all of them are not really Kubernetes related, e.g.:

  • Docker 1.11 to 1.12.5 included an evil bug where the Docker daemon becomes unresponsive (docker ps hangs). We hit this problem every week on at least one of our Kubernetes nodes. Our workaround was upgrading to Docker 1.13 RC2 (we now moved back to 1.12.6 as the fix was backported).
  • We saw some processes getting stuck in “pipe wait” while writing to STDOUT when using the default Docker json logger (root cause was not identified yet).
  • There seem to be a lot more race conditions in Docker and you can find many “Docker daemon hangs” issues reported, we already expect to hit them once in a while.
  • Upgrading Docker clients to 1.13 broke pulls from our Pier One registry (pulls from gcr.io were broken too). We implemented a quick workaround in Pier One until Docker fixed the issue upstream.
  • A thread on Twitter suggested adding the --iptables=false flag for Docker 1.13. We spend some time until we found out that this is a bad idea. NAT for the Flannel overlay network breaks when adding --iptables=false.

We learned that Docker can be quite painful to run in production because of the many tiny bugs (race conditions). You can be sure to hit some of them when running enough nodes 24x7. Also better not touch your Docker version once you have a running setup.

etcd

Kubernetes relies on etcd for storing the state of the whole cluster. Losing etcd consensus makes the Kubernetes API server essentially read only, i.e. no changes can be performed in the cluster. Losing etcd data requires rebuilding the whole cluster state and would probably cause a major downtime. Luckily all data can be restored as long as at least one etcd node is alive.

Knowing the criticality of the etcd cluster, we decided to use our existing, production-grade STUPS etcd cluster running on EC2 instances separate from Kubernetes. The STUPS etcd cluster registers all etcd nodes in Route53 DNS and we use etcd’s DNS discovery feature to connect Kubernetes to the etcd nodes. The STUPS etcd cluster is deployed across availability zones (AZ) with five nodes in total. All etcd nodes run our own STUPS Taupage AMI, which (similar to CoreOS) runs a Docker image specified via AWS user data (cloud-init).

Developers’ Guide

Developers’ guide for the Kubernetes on AWS project.

Contents:

Repositories

Relevant Repositories

The following OSS repositories are relevant for the project and code, issues and information that must be checked regularly:

Kubernetes on AWS (GH)
https://github.com/zalando-incubator/kubernetes-on-aws
STUPS etcd
https://github.com/zalando/stups-etcd-cluster
Zalando Kubectl Wrapper
https://github.com/zalando-incubator/zalando-kubectl
External DNS
https://github.com/kubernetes-incubator/external-dns
Kubernetes AWS Cluster Autoscaler
https://github.com/hjacobs/kube-aws-autoscaler
Kubernetes Job Cleaner
https://github.com/hjacobs/kube-job-cleaner
Kubernetes Operational View
https://github.com/hjacobs/kube-ops-view

The responsible maintainers are indicated in the respectibe MAINTAINERS file in each repository.

Ingress

These repositories are relevant for supporting Kubernetes Ingress resources on AWS:

Skipper with Kubernetes data client
https://github.com/zalando/skipper
Ingress AWS ELB Controller
https://github.com/zalando-incubator/kube-ingress-aws-controller

ADR-001: Store cluster versions in Cluster Registry

Context

The Cluster Lifecycle Manager (CLM) can pull configuration from two separate sources, e.g. the Cluster Registry and a channel source (git repository).

The CLM must be able to go away and later come back to continue where it left off. Therefore it must store the current cluster configuration state somewhere. The cluster configuration state is defined by a version of the channel source and the configuration currently stored in a Cluster Registry.

The configuration state should be in a format that is not tied to a specific implementation of the CLM this way the CLM can support multiple ways of provisioning clusters.

The CLM should be able to support multiple configuration sources and be able to provision clusters in multiple ways. Therefore the configuration format should not be tied to a specific implementation of configuration sources or provisioning method.

Decision

CLM will store the current configuration state under the status field of the Cluster resource in the Cluster Registry. The configuration state will be stored as three versions:

next_version
This indicates that the cluster is being updated to this version next, this is mostly used for debugging purposes.
current_version
This is the current version the cluster has.
last_version
This is the last working version. The last version is also used for rolling back a cluster in case the new version is broken.

Each version is a string defined as the channel version (git commit sha1) concatenated with a separator character “#” and the sha1 hash of the current cluster config (excluding the status field) from the Cluster Registry.

We decided to encode the version as a simple string because:

  • Splitting into multiple fields or properties would push the CLM implementation detail unnecessarily to the Cluster Registry schema (as we might think of different implementations with more version parts, e.g. a Kops provisioner relying on a certain Kops version)
  • The concrete version string format only needs to be known in one place (the CLM provisioner implementation), i.e. the string can be opaque to all other systems
  • A simple string field is easily read and “parsed” by humans for debugging
  • KISS

Status

Accepted.

Consequences

  • CLM will be responsible for deriving the cluster config hash based on the cluster config from the Cluster Registry.
  • CLM will be responsible for concatenating/comparing version strings. The Cluster Registry will for instance not be aware of the format of the versions which are stored there.
  • CLM can have several provisioner implementations which each can define its own versioning format without requiring changes in the Cluster Registry.

ADR-002: Installation of Kubernetes non core system components

Context

In cluster.py we used to install all the kube-system components using a systemd unit. This consisted basically in a bash script that deployed all the manifests from /srv/kubernetes/manifests/*/*.yaml using kubectl. We obviously do not want to update versions manually via kubectl. Furthermore, this approach also meant that we had to launch a new master instance in order to apply the updated manifests.

Decision

We will do the following:

  • remove entirely the “install-kube-system” unit from the master user data.
  • create a folder with all the manifests for each of the kubernetes artifact
  • apply all the manifests from the Cluster Lifecycle Manager code

Some of the possible alternatives for the folder structures are:

  1. /manifests/APPLICATION_NAME/deployment.yaml - which uses a folder structure that includes the APPLICATION_NAME
  2. /manifests/APPLICATION_NAME/KIND/mate.yaml - which uses a folder structure that includes APPLICATION_NAME and KIND
  3. /manifests/mate-deployment.yaml - where we have a flat structure and the filenames contain the name of the application and the kind
  4. /manifests/mate.yaml - where mate.yaml contains all the artifacts of all kinds related to mate

We choose number 1 as it seems the most compelling alternative. Number 2 will only introduce an additional folder level that does not provide any benefit. Number 3 will instead rely on a naming convention on the given kind. Number 4, instead, is a competitive alternative to number 1 and could be adopted, but we prefer to go with number 1 as this is very flexible and probably more readable for the maintainer. For the file naming convention, we recommend to split in files for kind when is possible and put the name (or just a prefix) in the file name. We will not make any assumption on the file naming scheme in the code. Also, no assumption will be made on the order of execution of such files.

Status

Accepted.

Consequences

The chosen file convention will be relevant when discussing the removal of components from kube-system. This is currently out of scope for this ADR as this only covers the “apply” case.

ADR-003: Organize cluster versions in branches

Context

When managing multiple clusters with different SLOs there is a need for pinning different clusters to different channels of the cluster configuration. For instance a production cluster might require a more stable channel of the cluster configuration than a test or playground cluster where we want to try out new, not yet stable, features.

To be able to manage multiple channels for different clusters we need to define a process describing:

  • What defines a channel.
  • How to move patches/hotfixes between channels.
  • How to promote an “unstable” channel to “stable”.
  • How to try out experimental features.

Decision

Cluster configuration channels will map to git branches in the configuration repository. The branch layout is shown below.

PR (experimental-branch-1)-
                           \
PR (feature-2) ------------------> dev
                           /         \
PR (hotfix-3) ----------------------> alpha
                           \             \
                            \----------> beta
                                          \
                                          stable

dev is the default branch and is the main entrypoint for new feature PRs. Every new feature should therefore start as a PR targeting dev and should flow to the other channels only from the dev channel. Critical hotfixes can go directly to the relevant channels.

Experimental features should be tested on a separate branch which is based on `dev` before they are merged into the `dev` branch.

Specifying the channel for a cluster is done by assigning a branch/channel name to the channel field of a cluster resource in the Cluster Registry.

  • TBD: when is something considered ready to be promoted? (after X days automatically)?
  • TBD: how is something promoted from dev to alpha (and further up)?
  • TBD: what controls do we need when promoting (four eyes?)?

Status

Proposed.

Consequences

  • The default branch of kubernetes-on-aws becomes dev.
  • We need to protect dev/alpha/beta/stable branches.

ADR-004: Roles and Service Accounts

Context

We need to define roles and service accounts to allow all our use cases. Our first concerns are to allow the following:

  • Users should be able to deploy (manually in test cluster, via the deploy API in production clusters), but we do not want them by default to read secrets
  • Admins should get full access to all resources, mostly for emergency access
  • Applications should not get by default write access to the Kubernetes API
  • It should be possible for some applications to write to the Kubernetes API.

Decision

We define the following Roles:

  • ReadOnly: allowed to read every resource, but not secrets. “exec” and “proxy” and similar operations are not allowed. Allowed to do “port-forward” to special proxy, which will enable DB access.
  • PowerUser: “restricted” Pod Security Policy with write access to all namespaces but kube-system, ReadOnly access to kube-system namespace, “exec” and “proxy” are allowed, RW for secrets, no write of daemonsets. DB access through “port-forward” and special proxy.
  • Operator: “privileged” Pod Security Policy with write access to the own namespace and read and write access to third party resources in all namespaces.
  • Controller: Kubernetes component controller-manager is not allowed to “use” other Pod Security Policies then “restricted”, such that serviceAccount authorization is used to check the permission. To all other resources it has full access.
  • Admin: full access to all resources

And the following pairs <namespace, service account> that will get the listed role, assigned by the WebHook:

  • “kube-system:default” - Admin
  • “default:default” - ReadOnly
  • “*:operator” - Operator
  • kube-controller-manager - Controller
  • kubelet - Admin

Application that will want write access to the Kubernetes API will have to use the “operator” service account.

Status

Accepted.

Consequences

This decision is a breaking change of what was previously defined for applications. Users that need applications with write access to the Kubernetes API will need to select the right service account. The controller-manager has now an identity and uses the secured kube-apiserver endpoint, such that it can be authorized by the webhook.

ADR-005: How to use channels

Context

In ADR-004 we defined some initial characteristics of clusters and how they map to git branches and the definition of the dev branch/channel. We still needed to define how many branches we use and how to promote from one to the other. In this ADR we answer those remaining questions.

Decision

We decided to use the following branches/channels:

dev
this is the default branch for the project. By default all new features and bugfixing that are not to be considered as hotfixes will be against this channel.
alpha
this is the branch immediately after dev.
stable
this is the most stable branch.

The following diagram shows how the process of working with the channels works:

_images/ADR-005.svg

Every branch with a pull request open will trigger end to end testing as soon as the pull request is labelled as “ready-to-test”. While discussing the end to end (e2e) strategy is out of scope for this ADR, we define here the following requirements:

  • The e2e testing infrastructure will create a new cluster
  • All the e2e tests will run on the aforementioned cluster
  • The e2e tests will report the status to the PR
  • The cluster will be deleted as soon as the tests finish

We decide against polluting the cluster registry which means that the testing infrastructure will create the cluster using the Cluster Lifecycle Manager (CLM) functionalities locally and not by creating entries in the cluster registry.

Once the PR is approved and merged into the dev branch/channel, all the clusters using the channel will be updated. The list includes as a minimum and by design the following clusters:

  • Infrastructure cluster to test cluster setup

The cluster above will be tested with smoke tests that could include the end to end tested executed on the branch. Testing on this cluster has the following goals:

  • Testing the update on an updated cluster and not on a fresh cluster as this might show some different behavior.
  • Testing the impact of the update on already running applications. This requires that the cluster beeing update will have running applications covering different Kubernetes features.

Additionally to the tests, ZMON metrics will be monitored closely. If nothing wrong is seen, after X hours, an automatic merge into the alpha branch will be executed. This will trigger updates to all the cluster running the alpha branch. This include as minimum the following clusters:

  • Infrastructure prod cluster
  • Infrastructure test cluster
  • playground

A transition to stable channel will be started by an automatic creation of a PR after Y days. Differently from the previous step, this PR will not be automatically merged, but will require additional human approval.

Hotfixes

Hotfixes can be made via PRs to any of the channels. This allows to increase the speed by which we fix important issues in all the channels. A hotfix PR will still need to pass the e2e tests in order to be merged.

Status

Accepted.

Consequences

The following are important consequences:

  • No manual merges to alpha or stable channels are possible for changes that are not hotfixes.
  • With this ADR we rely heavily on our e2e testing infrastructure to guarantee stability/quality of our PRs.
  • Clusters might be assigned to different channels depending of different requirements like SLOs.

Indices and tables