Welcome to Kubernetes on AWS’ documentation!¶
Note
This documentation is only an extract from our internal Zalando documentation. It’s provided in the hope that it helps the Kubernetes community.
Contents:
User’s Guide¶
How to use the Kubernetes cluster.
Labels and Selectors¶
Labels are key/value pairs that are attached to Kubernetes objects, such as pods (this is usually done indirectly via deployments). Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users. Labels can be used to organize and to select subsets of objects. See Labels and Selectors in the Kubernetes documentation for more information.
The following Kubernetes labels have a defined meaning in our Zalando context:
- application
- Application ID as defined in our Kio application registry. Example: “zmon-controller”
- version
- User-defined application version. This is used as input for the CI/CD pipeline and usually references a Docker image tag. Example: “cd53”
- release
- Incrementing release counter. This is generated by the CI/CD pipeline and is used for traffic switching. Example: “4”
- stage
- Deployment stage to allow canary deployments. Allowed values are “canary” and “production”.
- owner
- Owner of the Kubernetes resource. This needs to reference a valid organizational entity in the context of the cluster’s business partner. Example: “team/eagleeye”
Some labels are required for every deployment resource:
- application
- version
- release
- stage
Example deployment metadata:
metadata:
labels:
application: my-app
version: "v31"
release: "r42"
stage: production
Kubernetes services will usually select only on application
and stage
:
kind: Service
apiVersion: v1
metadata:
name: my-app
spec:
selector:
application: my-app
stage: production
ports:
- port: 80
targetPort: 8080
protocol: TCP
You can always define additional custom labels as long as they don’t conflict with the above label catalog.
Zalando Platform IAM Integration¶
Introductory Remark: After the learnings of two years of STUPS/IAM integration, we will integrate a more advanced and simpler solution. The integration will be Kubernetes native and will take complexity out of your application. Instead of providing you with client ID, client secret, username and password that you then have to use to generate tokens regularly, we will provide you a simple way to directly obtain the ready-to-go tokens instead of fiddling with the credentials. Technically speaking, this means you just need to read your current token from a text file in your filesystem and you are done - no complicated token libraries anymore.
The user flow for a new application to get OAuth credentials looks like:
- Register the new application in the Kio application registry via the YOUR TURN frontend.
- Configure OAuth scopes in Mint via the “Access Control” in the YOUR TURN frontend.
- Configure required OAuth credentials (tokens and/or clients) in Kubernetes via a new Platform IAM Credentials resource.
Platform IAM Credentials¶
The PlatformCredentialsSet
resource allows application owners to declare needed OAuth credentials.
apiVersion: "zalando.org/v1"
kind: PlatformCredentialsSet
metadata:
name: my-app-credentials
spec:
application: my-app # has to match with registered application in kio/yourturn
tokens:
full-access: # token name
privileges: # privileges/scopes for the token.
# All zalando-specific privileges start with namespace com.zalando, following pattern <namespace>::<privilege>
# the privileges/scopes you define here should match those you define for your application in yourturn.
- com.zalando::foobar.write
- com.zalando::acme.full
read-only: # token name
privileges: # privileges/scopes for the token.
- com.zalando::foobar.read
clients:
employee: # client name
# the allowed grant type, see https://tools.ietf.org/html/rfc6749
# options: authorization-code, implicit, resource-owner-password-credentials, client-credentials
# (values directly reference RFC section titles)
grant: authorization-code
# the client's account realm
# options: users, customers, services
# ("services" realm should not be used for clients, use the "tokens" section instead!)
realm: users
# redirection URI as described in https://tools.ietf.org/html/rfc6749#section-2
redirectUri: https://example.org/auth/callback
The declared credentials will automatically be provided as a secret with the same name.
Following this example you would get a token called full-access
with the
privileges com.zalando::foobar.write
and com.zalando::acme.full
, a
token called read-only
with privileges com.zalando::foobar.read
and a
client named employee
which uses authorization-code
grant under realm
users
.
Secrets¶
Automatically generated secrets provide the declared OAuth credentials in the following form:
apiVersion: v1
kind: Secret
metadata:
name: my-app-credentials
type: Opaque
data:
full-access-token-type: Bearer
full-access-token-secret: JwAbc123.. # JWT token
read-only-token-type: Bearer
read-only-token-secret: JwBcd456.. # JWT token
employee-client-id: 67b86a55-61e6-4862-aa14-70fe7be788f4
employee-client-secret: 5585942c-ce79-44e4-aac2-8af565b51d3e
The secret can conveniently be mounted to read the tokens and client credentials from a volume:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
template:
metadata:
labels:
application: my-app
spec:
containers:
- name: my-app
image: pierone.stups.zalan.do/myteam/my-app:cd53
ports:
- containerPort: 8080
volumeMounts:
- name: my-app-credentials
mountPath: /meta/credentials
readOnly: true
volumes:
- name: my-app-credentials
secret:
secretName: my-app-credentials
The application can now simply read the declared tokens from text files, i.e. even a simple Bash script suffices to use OAuth tokens:
#!/bin/bash
type=$(cat /meta/credentials/read-only-token-type)
secret=$(cat /meta/credentials/read-only-token-secret)
curl -H "Authorization: $type $secret" https://resource-server.example.org/protected
Either use one of the supported token libraries or implement the file read on your own. How to read a token in different languages:
# Python
with open('/meta/credentials/{}-token-secret'.format(token_name)) as fd:
access_token = fd.read().strip()
// JavaScript (node.js)
const accessToken = String(fs.readFileSync(`/meta/credentials/${tokenName}-token-secret`)).trim()
// Java
String accessToken = new String(Files.readAllBytes(Paths.get("/meta/credentials/" + tokenName + "-token-secret"))).trim();
Note
Using the authorization type from the secret instead of hardcoding Bearer
allows to transparently
switch to HTTP Basic Auth in a different context (e.g. running an Open Source application in a non-Zalando environment).
Users would simply need to provide an appropriate secret like:
apiVersion: v1
kind: Secret
metadata:
name: my-app-credentials
type: Opaque
data:
full-access-token-type: Basic
full-access-token-secret: dXNlcjpwYXNzCg== # base64 encoded user:pass
read-only-token-type: Basic
read-only-token-secret: dXNlcjpwYXNzCg== # base64 encoded user:pass
Problem Feedback¶
Providing the requested credentials (tokens, clients) may fail for various reasons:
- the
PlatformCredentialsSet
has syntactic errors - the application (
application
property) does not exist or is missing required configuration - the application is not allowed to obtain the requested credentials (e.g. missing privileges)
- some other error occurred
All problems with credential distribution are written to the secret with the same name as the PlatformCredentialsSet
:
apiVersion: v1
kind: Secret
metadata:
name: my-app-credentials
annotations:
zalando.org/problems: |
- type: https://credentials-provider.example.org/not-enough-privileges
title: Forbidden: Not enough privileges
status: 403
instance: tokens/full-access
type: Opaque
data:
# NOTE: the declared "full-access" token is missing as it was denied
read-only-token-type: Bearer
read-only-token-secret: JwBcd456.. # JWT token
employee-client-id: 67b86a55-61e6-4862-aa14-70fe7be788f4
employee-client-secret: 5585942c-ce79-44e4-aac2-8af565b51d3e
The zalando.org/problems
annotation contains a list of “Problem JSON” objects as defined in RFC 7807 as YAML.
At least the fields type
, title
and instance
should be set by the component processing the PlatformCredentialsSet
resource:
type
- Machine-readable URI reference that identifies the problem type (e.g. https://example.org/invalid-grant)
title
- Short, human-readable summary of the problem type (e.g. “Invalid client grant”)
instance
- Relative path indicating the problem location, this should reference the token or client (e.g.
clients/my-client
)
See also the Problem OpenAPI schema YAML.
AWS IAM integration¶
This section describes how to setup an AWS IAM role which can then be assumed by pods running in a Kubernetes cluster. You only need AWS IAM roles if your application calls the AWS API directly (e.g. to store data in some S3 bucket).
Create IAM Role with AssumeRole trust relationship¶
In order for an AWS IAM role to be assumed by the worker node and passed on to a pod running on the node, it must allow the worker node IAM role to assume it.
This is achived by adding a trust relation on the role trust relationship
policy document. Assuming the account number is 12345678912
and the cluster
name is kube-1
, the policy document would look like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::12345678912:role/kube-1-worker"
},
"Action": "sts:AssumeRole"
}
]
}
Reference IAM role in pod¶
In order to use the IAM role in a pod you simply need to reference the role
name in an annotation on the pod specification. As an example we can create a
simple deployment for an application called myapp
which require the IAM
role myapp-iam-role
:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 1
template:
metadata:
labels:
app: myapp
annotations:
iam.amazonaws.com/role: myapp-iam-role
spec:
containers:
- name: myapp
image: myapp:v1.0
To test that the pod gets the correct role you can exec into the container and query the metadata endpoint.
$ zkubectl exec -it myapp-podid -- sh
$ curl -s 169.254.169.254/latest/meta-data/iam/security-credentials/
myapp-iam-role
The response should be the name of the role available from within the pod.
Ingress¶
This section describes how to expose a service to the internet by defining Ingress rules.
What is Ingress?¶
Ingress allows to expose a service to the internet by defining its HTTP layer address. Ingress settings include:
- TLS certification
- host name
- path endpoint (optional)
- service and service port
The Ingress services, when detecting a new or modified Ingress entry, will create/update the DNS record for the defined hostname, will update the load balancer to use a TLS certificate and route the requests to the cluster nodes, and will define the routes that find the right service based on the hostname and the path.
More details about the general Ingress in Kubernetes can be found in the official Ingress Resources.
How to setup Ingress?¶
Let’s assume that we have a deployment with label application=test-app
, providing an API service on port
8080 and an admin UI on port 8081. In order to make them accessible from the internet, we need to create a
service first.
Create a service¶
The service definition looks like this, create it in the apply
directory as service.yaml
:
apiVersion: v1
kind: Service
metadata:
name: test-app-service
labels:
application: test-app-service
spec:
ports:
- port: 8080
protocol: TCP
targetPort: 8080
name: main-port
- port: 8081
protocol: TCP
targetPort: 8081
name: admin-ui-port
selector:
application: test-app
Note that we didn’t define the type
of the service. This means that the service type will be the default ClusterIP
, and
will be accessible only from inside the cluster.
Create the Ingress rules¶
Let’s assume that we want to access this API and admin UI from the internet with the base URL
https://test-app.playground.zalan.do, and we want to access the UI on the path /admin
while all other endpoints
should be directed to the API. We can create the following Ingress entry in the apply
directory as ingress.yaml
:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: test-app
spec:
rules:
- host: test-app.playground.zalan.do
http:
paths:
- backend:
serviceName: test-app-service
servicePort: main-port
- path: /admin
backend:
serviceName: test-app-service
servicePort: admin-ui-port
Once the changes were applied by the pipeline, the API and the admin UI should be accessible at https://test-app.playground.zalan.do and https://test-app.playground.zalan.do/admin. (If the load balancer and/or the DNS entry are newly created, it can take ~1 minute for everything to be ready.) Already provisioned X509 Certificate (IAM and ACM) will be found and matched automatically for your Ingress resource.
Manually selecting a certificate¶
The right certificate is usually discovered automatically,
but there might be occasions where the SSL certificate ID (ARN) needs to be specified manually
(e.g. if a CNAME
in another account points to our Ingress).
Let’s assume we want to hard code our certificate that is used in the
ALB to terminate TLS for https://test-app.playground.zalan.do/.
We can create the following Ingress entry in the apply
directory as ingress.yaml
:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: test-app
annotations:
zalando.org/aws-load-balancer-ssl-cert: <certificate ARN>
spec:
rules:
- host: test-app.playground.zalan.do
http:
paths:
- backend:
serviceName: test-app-service
servicePort: main-port
Certificate ARN¶
In the above template, the token <certificate ARN> is meant to be replaced with the ARN of a valid certificate available for your account. You can find the right certificate in one of the following two ways:
1. For standard IAM certificates:
aws iam list-server-certificates
... should display something like this:
{
"ServerCertificateMetadataList": [
{
"ServerCertificateId": "ABCDEFGHIJKLMNOPFAKE1",
"ServerCertificateName": "self-signed-cert1",
"Expiration": "2026-12-13T08:31:06Z",
"Path": "/",
"Arn": "arn:aws:iam::123456789012:server-certificate/self-signed-cert1",
"UploadDate": "2016-12-15T08:48:03Z"
},
{
"ServerCertificateId": "ABCDEFGHIJKLMNOPFAKE2",
"ServerCertificateName": "self-signed-cert2",
"Expiration": "2026-12-13T08:51:22Z",
"Path": "/",
"Arn": "arn:aws:iam::123456789012:server-certificate/self-signed-cert2",
"UploadDate": "2016-12-15T08:51:41Z"
},
{
"ServerCertificateId": "ABCDEFGHIJKLMNOPFAKE3",
"ServerCertificateName": "teapot-zalan-do",
"Expiration": "2023-05-11T00:00:00Z",
"Path": "/",
"Arn": "arn:aws:iam::123456789012:server-certificate/teapot-zalan-do",
"UploadDate": "2016-05-12T12:26:52Z"
}
]
}
...where you want to use the Arn
values.
2. For Amazon Certificate Manager (ACM) certificates:
aws acm list-certificates
...should print something like this:
{
"CertificateSummaryList": [
{
"CertificateArn": "arn:aws:acm:eu-central-1:123456789012:certificate/12345678-1234-1234-1234-123456789012",
"DomainName": "teapot.zalan.do"
},
{
"CertificateArn": "arn:aws:acm:eu-central-1:123456789012:certificate/12345678-1234-1234-1234-123456789012",
"DomainName": "*.teapot.zalan.do"
}
]
}
...where you want to use the CertificateArn
values.
Alternatives¶
You can expose an application with its own load balancer, described in the TLS Termination and DNS. The two methods can live next to each other, but they need to have separate service definitions (due to the different service types).
Container resource limits¶
Note
This is a preliminary summary from skimming docs and educational guessing. No evaluation done. It could contain errors.
Resource definitions¶
There are two supported resource types: cpu
and memory
. In future versions of Kubernetes
one will be able to add custom resource types and the current implementation might be
based on that.
CPU resources are measured in virtual cores or more commonly in “millicores” (e.g. 500m
denoting 50% of a vCPU).
Memory resources are measured in Bytes and the usual suffixes can be used, e.g. 500Mi
denoting 500 Mebibyte.
For each resource type there are two kinds of definitions: requests
and limits
.
Requests and limits are defined per container. Since the unit of scheduling is a pod
one needs to sum them up to get the requests and limits of a pod.
The resulting four combinations are explained in more detail below.
Resource requests¶
In general, requests are used by the scheduler to find a node that has free resources to take the pod. A node is full when the sum of all requests equals the registered capacity of that node in any resource type. So, if the requests of a pod are still unclaimed on a node, the scheduler can schedule a pod there.
Note that this is the only metric the scheduler uses (in that context). It doesn’t take the actual usage of the pods into account (which can be lower or higher than whatever is defined in requests).
- Memory requests
- Used for finding nodes with enough memory and making better scheduling decisions.
- CPU requests
- Maps to the docker flag
--cpu-shares
, which defines a relative weight of that container for CPU time. The relative share is executed per core, which can lead to unexpected outcomes but probably nothing to worry about in our use cases. A container will never be killed because of this metric.
Resource limits¶
Limits define the upper bound of resources a container can use. Limits must always be greater or equal to requests. The behavior differs between CPU and memory.
- Memory limits
- Maps to the docker flag
--memory
, which means processes in the container get killed by the kernel if they hit that memory usage (OOMKilled
). Given you run one process per container this will kill the whole container and Kubernetes will try to restart it. - CPU limits
- Maps to the docker flag
--cpu-quota
, which limits CPU time of that container’s processes. Seems like you can define that a container can only max utilize a core by e.g. 50%. But, let’s assume you have 3 of them on a single-core node this can lead to over-utilizing it.
Conclusion¶
requests
are for making scheduling decisionslimits
are real resource limits of containers- effect of CPU limits is not completely straight forward to understand
- choosing higher
limits
thanrequests
allows to over-provision nodes, but has the danger of over-utilizing it requests
are required for using the horizontal pod autoscaler
Persistent Storage¶
Some of your pods need to persist data across pod restarts (e.g. databases). In order to facilitate this we can mount folders into our pods that are backed by EBS volumes on AWS.
Deploying Redis¶
In this example we’re going to deploy a non high-available but persistent Redis container.
We start out by deploying a non-persistent version first and then extend it to keep our data across pod and node restarts. Submit the following two manifests to your cluster to create a deployment and a service for your redis instance.
apiVersion: v1
kind: Service
metadata:
name: redis
spec:
ports:
- port: 6379
targetPort: 6379
selector:
application: redis
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: redis
spec:
replicas: 1
template:
metadata:
labels:
application: redis
version: 3.2.5
spec:
containers:
- name: redis
image: redis:3.2.5
Your service can be accessed from other pods by using the automatically generated cluster-internal DNS name or service
IP address. So given you use the manifests as printed above and you’re running in the default
namespace you should find your Redis instance at redis.default.svc.cluster.local
from any other pod.
You can run an interactive pod and test that it works. You can use the same Redis image as it contains the redis CLI.
$ zkubectl run redis-cli --rm -ti --image=redis:3.2.5 --restart=Never /bin/bash
$ redis-cli -h redis.default.svc.cluster.local
redis-default.hackweek.zalan.do:6379> quit
Creating a volume¶
There’s one major problem with your Redis container: It lacks some persistent storage. So let’s add it.
We’ll be using something that’s called a PersistentVolumeClaim
. Claims are an abstraction over the actual
storage system in your cluster. With a claim you define that you need some amount of storage at some path inside your
container. Based on your needs the cluster management system will provision you some storage out of its available
storage pool. In case of AWS you usually get an EBS volume attached to the node and mounted into your container.
Submit the following file to your cluster in order to claim 10GB of standard storage.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: redis-data
annotations:
volume.beta.kubernetes.io/storage-class: standard
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
standard
is a storage class that we defined in the cluster. It’s implemented via an SSD-EBS volume.
ReadWriteOnce
means that this storage can only be attached to one instance at a time. Both of these values can be
safely ignored, more important for you are the name and the requested size of storage.
After submitting the manifest to the cluster you can list your storage claims:
$ zkubectl get persistentVolumeClaims
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
redis-data Bound pvc-fc26de82-b577-11e6-b2a5-02c15a33e7b7 10Gi RWO 4s
Status Bound
means that your claim was successfully implemented and is now bound to a persistent volume. You can
also list all volumes:
$ zkubectl get persistentVolumes
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM REASON AGE
pvc-fc26de82-b577-11e6-b2a5-02c15a33e7b7 10Gi RWO Delete Bound default/redis-data 8m
If you want to dig deeper you can describe the volume and see that it’s backed by an EBS volume.
$ zkubectl describe persistentVolume pvc-fc26de82-b577-11e6-b2a5-02c15a33e7b7
Name: pvc-fc26de82-b577-11e6-b2a5-02c15a33e7b7
Labels: failure-domain.beta.kubernetes.io/region=eu-central-1
failure-domain.beta.kubernetes.io/zone=eu-central-1b
Status: Bound
Claim: default/redis-data
Reclaim Policy: Delete
Access Modes: RWO
Capacity: 10Gi
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://eu-central-1b/vol-a36c7039
FSType: ext4
Partition: 0
ReadOnly: false
No events.
Here, you can also see in which zone the EBS volume was created. Any pod that wants to mount this volume must be scheduled to a node running in that same zone. Luckily, Kubernetes takes care of that.
Attaching a volume to a pod¶
Modify your deployment in the following way in order to use the persistent volume claim we created above.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: redis
spec:
replicas: 1
template:
metadata:
labels:
application: redis
version: 3.2.5
spec:
containers:
- name: redis
image: redis:3.2.5
volumeMounts:
- mountPath: /data
name: redis-data
volumes:
- name: redis-data
persistentVolumeClaim:
claimName: redis-data
We did two things here: First we registered the persistentVolumeClaim
under the volumes
section in the pod
definition and gave it a name. Then, by using the name, we mounted that volume under a path in the container in the
volumeMounts
section. The reason for having a two-level definition here is because multiple containers in the same
pod can mount the same volume under different paths, e.g. for sharing data.
Secondly, our Redis container uses /data
to store its data which is where we mounted our persistent volume.
This way, anything that Redis stores will be written to the EBS volume and thus can be mounted on another node in case
of node failure.
Note, that you usually want replicas
to be 1
when using this approach. Though, you can use more replicas which
would result in many pods mounting the same volume. As this volume is backed by an EBS volume this forces Kubernetes
to schedule all replicas on the same node. If you require multiple replicas, each with their own persistent volume,
you should rather think about using a StatefulSet
instead.
Trying it out¶
Find out where your pod currently runs:
$ zkubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
redis-3548935762-qevsk 1/1 Running 0 2m 10.2.1.66 ip-172-31-15-65.eu-central-1.compute.internal
The node it landed on is ip-172-31-15-65.eu-central-1.compute.internal
. Connect to your Redis endpoint and create some data:
$ zkubectl run redis-cli --rm -ti --image=redis:3.2.5 --restart=Never /bin/bash
$ redis-cli -h redis.default.svc.cluster.local
redis-default.hackweek.zalan.do:6379> set foo bar
OK
redis-default.hackweek.zalan.do:6379> get foo
"bar"
redis-default.hackweek.zalan.do:6379> quit
Simulate a pod failure by deleting your pod. This will make Kubernetes create a new one potentially on another node but always in the same zone due to using an EBS volume.
$ zkubectl delete pod redis-3548935762-qevsk
pod "redis-3548935762-qevsk" deleted
$ zkubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
redis-3548935762-p4z9y 1/1 Running 0 1m 10.2.72.2 ip-172-31-10-115.eu-central-1.compute.internal
In this example the new pod landed on another node (ip-172-31-10-115.eu-central-1.compute.internal
).
Let’s check that it’s available and didn’t loose any data. Connect to Redis in the same way as before.
$ zkubectl run redis-cli --rm -ti --image=redis:3.2.5 --restart=Never /bin/bash
$ redis-cli -h redis.default.svc.cluster.local
redis-default.hackweek.zalan.do:6379> get foo
"bar"
redis-default.hackweek.zalan.do:6379> quit
And indeed, everything is still there.
Deleting a volume¶
All it takes to delete a volume is to delete the corresponding claim that initiated its creation in the first place.
$ zkubectl delete persistentVolumeClaim redis-data
persistentvolumeclaim "redis-data" deleted
To fully clean up after yourself also delete the deployment and the service:
$ zkubectl delete deployment,service redis
service "redis" deleted
deployment "redis" deleted
Logging¶
Zalando cluster will ship logs to Scalyr for all containers running on a cluster node. The logs will include extra attributes/tags/metadata depending on deployment manifests. Whenever a new container starts on a cluster node, its logs will be shipped.
Note
Logs are shipped per container and not per application. To view all logs from certain application you can use Scalyr UI https://www.scalyr.com/events and filter using Logs attributes.
One Scalyr account will be provisioned for each community, i.e. the same Scalyr account is used for both test and production clusters.
You need to make sure the minimum requirements are satisfied to start viewing logs on Scalyr.
Requirements¶
Logging output¶
Always make sure your application logs to stdout
& stderr
. This will allow cluster log shipper to follow application logs, and also allows you to follow logs via Kubernetes native logs
command.
$ zkubectl logs -f my-pod-name my-container-name
Labels¶
In order for the container logs to be shipped, your deployment must include the following metadata labels:
- application
- version
Logs attributes¶
All logs are shipped with extra attributes that can help in filtering from Scalyr UI (or API). Usually those extra fields are extracted from deployment labels, or the Kubernetes cluster/API.
application
- Application ID. Retrieved from metadata labels.
version
- Application version. Retrieved from metadata labels.
release
- Application release. Retrieved from metadata labels. [optional]
cluster
- Cluster ID. Retrieved from Kubernetes cluster.
container
- Container name. Retrieved from Kubernetes API.
node
- Cluster node running this container. Retrieved from Kubernetes cluster.
pod
- Pod name running the container. Retrieved from Kubernetes cluster.
namespace
- Namespace running this deployment(pod). Retrieved from Kubernetes cluster.
Log parsing¶
The default parser for application logs is the json
parser.
In some cases however you might want to use a custom Scalyr parser for your application. This can be
achieved via Pod annotations.
However, the json
parser only parses the JSON generated from the Docker logs. If your application
generates logs in JSON, the default parser will only see them as an escaped string of JSON.
However, Scalyr provides a special parser escapedJson
for that.
Scalyr’s default parser can even be configured to also make a pass with the escapedJson
parser. That way
there is no need to configure anything on a per application level to get properly parsed fields from JSON based
application logs in Scalyr. Just edit the JSON parser to contain the
following config.
// Parser for log files containing JSON records.
{
attributes: {
// Tag all events parsed with this parser so we can easily select them in queries.
dataset: "json"
},
formats: [
{format: "${parse=json}$", repeat: true},
{format: "\\{\"log\":\"$log{parse=escapedJson}$", repeat: true}
]
}
The following example shows how to annotate a pod to instruct the log watcher
to use the custom parser json-java-parser
for pod container my-app
.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
template:
metadata:
labels:
application: my-app
annotations:
# specify scalyr log parser
kubernetes-log-watcher/scalyr-parser: '[{"container": "my-app-container", "parser": "json-java-parser"}]'
spec:
containers:
- name: my-app-container
image: pierone.stups.zalan.do/myteam/my-app:cd53
ports:
- containerPort: 8080
The value of kubernetes-log-watcher/scalyr-parser
annotation should be a
JSON serialized list. If container
value did not match, then it will fall
back to the default parser (i.e. json
).
Note
You need to specify the container in the parser annotation because you can have multiple containers in a pod which may use different log formats.
Running in Production¶
Number of Replicas¶
Always run at least two replicas (three or more are recommended) of your application to survive cluster updates and autoscaling without downtime.
Readiness Probes¶
Web applications should always configure a readinessProbe
to make sure that the container only gets traffic after a successful startup:
containers:
- name: mycontainer
image: myimage
readinessProbe:
httpGet:
# Path to probe; should be cheap, but representative of typical behavior
path: /.well-known/health
port: 8080
timeoutSeconds: 1
See Configuring Liveness and Readiness Probes for details.
Resource Requests¶
Always configure resource requests for both CPU and memory. The Kubernetes scheduler and cluster autoscaler need this information in order to make the right decisions. Example:
containers:
- name: mycontainer
image: myimage
resources:
requests:
cpu: 100m # 100 millicores
memory: 200Mi # 200 MiB
Resource Limits¶
You should configure a resource limit for memory if possible. The memory resource limit will get your container OOMKilled
when reaching the limit.
Set the JVM heap memory dynamically by using the java-dynamic-memory-opts
script from Zalando’s OpenJDK base image and setting MEM_TOTAL_KB
to limits.memory
:
containers:
- name: mycontainer
image: myjvmdockerimage
env:
# set the maximum available memory as JVM would assume host/node capacity otherwise
# this is evaluated by java-dynamic-memory-opts in the Zalando OpenJDK base image
# see https://github.com/zalando/docker-openjdk
- name: MEM_TOTAL_KB
valueFrom:
resourceFieldRef:
resource: limits.memory
divisor: 1Ki
resources:
requests:
cpu: 100m
memory: 2Gi
limits:
memory: 2Gi
Example application with IAM credentials¶
Note
This section describes the legacy way of getting OAuth credentials via Mint. Please read Zalando Platform IAM Integration for the recommended new approach.
This is a full example manifest of an application (myapp
) which uses IAM
credentials distributed via a mint-bucket (zalando-stups-mint-12345678910-eu-central-1
).
Here is an example of a policy that grants access to the specific folder in the Mint’s S3 bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Resource": [
"arn:aws:s3:::zalando-stups-mint-12345678910-eu-central-1/myapp/*"
],
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Sid": "AllowMintRead"
}
]
}
In this example the AWS access role for the S3 bucket is called myapp-iam-role
(See also AWS IAM integration for how to correctly setup such a role in AWS):
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 1
template:
metadata:
labels:
app: myapp
annotations:
iam.amazonaws.com/role: myapp-iam-role
spec:
containers:
- name: myapp
image: myapp:v1.0.0
env:
- name: CREDENTIALS_DIR
value: /meta/credentials
volumeMounts:
- name: credentials
mountPath: /meta/credentials
readOnly: true
- name: gerry
image: registry.opensource.zalan.do/teapot/gerry:v0.0.9
args:
- /meta/credentials
- --application-id=myapp
- --mint-bucket=s3://zalando-stups-mint-12345678910-eu-central-1
volumeMounts:
- name: credentials
mountPath: /meta/credentials
readOnly: false
volumes:
- name: credentials
emptyDir:
medium: Memory # share a tmpfs between the two containers
The first important part of the manifest is the annotations
section:
annotations:
iam.amazonaws.com/role: myapp-iam-role
Here we specify the role needed in order for the pod to get access to the S3 bucket with the credentials.
The next important part is the gerry
sidecar.
- name: gerry
image: registry.opensource.zalan.do/teapot/gerry:v0.0.9
args:
- /meta/credentials
- --application-id=myapp
- --mint-bucket=s3://zalando-stups-mint-12345678910-eu-central-1
volumeMounts:
- name: credentials
mountPath: /meta/credentials
readOnly: false
The gerry
sidecar container mounts the shared credentials
mount point
under /meta/credentials
and writes the credential files user.json
and
client.json
to this location.
To read these files from the myapp
container, the shared credentials
mount point is also mounted into the myapp
container.
- name: myapp
image: myapp:v1.0.0
env:
- name: CREDENTIALS_DIR
value: /meta/credentials
volumeMounts:
- name: credentials
mountPath: /meta/credentials
readOnly: true
TLS Termination and DNS¶
This section describes how to expose a service via TLS to the internet.
Note
You usually want to use Ingress instead to automatically expose your application with TLS and DNS.
Expose your app¶
Let’s deploy a simple web server to test that our TLS termination works.
Submit the following yaml
files to your cluster.
Note that this guide uses a top-down approach and starts with deploying the service first. This allows Kubernetes to better distribute pods belonging to the same service across the cluster to ensure high availability. You can, however, submit the files in any order you like and it will work. It’s all declaritive.
Create a service¶
Create a Service
of type LoadBalancer
so that your pods become
accessible from the internet through an ELB
. For TLS termination to work
you need to annotate the service with the ARN of the certificate you want to serve.
apiVersion: v1
kind: Service
metadata:
name: nginx
annotations:
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:eu-central-1:some-account-id:certificate/some-cert-id
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
spec:
type: LoadBalancer
ports:
- port: 443
targetPort: 80
selector:
app: nginx
This creates a logical service called nginx
that forwards all traffic to any pods
that match the label selector app=nginx
, which we haven’t created yet. The service (logically) listens on port 443 and forwards to
port 80 on each of the upstream pods, which is where the nginx processes will listen on.
We also define the protocol that our upstreams use. Often your upstreams will just speak plain HTTP so the second annotation’s value is actually the default value and can be omitted.
Make sure to define your service to listen on port 443 as this will be used as the listening port for your ELB.
Wait for a couple of minutes for AWS to provision an ELB
for you and for DNS to propagate.
Check the list of services to find out the endpoint of the ELB
that was created for you.
$ zkubectl get svc -o wide
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
nginx 10.3.0.245 some-long-hash.eu-central-1.elb.amazonaws.com 443/TCP 6m app=nginx
Create the deployment¶
Now let’s deploy some pods that actually implement our service.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
This creates a deployment called nginx
that will ensure to run two copies
of the nginx image from dockerhub listening on port 80. They match exactly the
labels that our service is looking for so they are dynamically added to the
service’s pool of upstreams.
Make sure your pods are running.
$ zkubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-1447934386-iblb3 1/1 Running 0 7m
nginx-1447934386-jj559 1/1 Running 0 7m
Now curl
the service endpoint. You’ll get a certificate warning since the hostname
doesn’t match the served certificate.
$ curl --insecure https://some-long-hash.eu-central-1.elb.amazonaws.com
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...
</body>
</html>
DNS records¶
For convenience you can assign a DNS name for your service so you don’t have to use the arbitrary ELB endpoints. The DNS name can be specified by adding an additional annotation to your service containing the desired DNS name.
apiVersion: v1
kind: Service
metadata:
name: nginx
annotations:
external-dns.alpha.kubernetes.io/hostname: my-nginx.playground.zalan.do
spec:
...
Note that although you specify the full DNS name here you must pick a name that
is inside the zone of the cluster, e.g. in this case *.playground.zalan.do
.
Also keep in mind that when doing this you can clash with other users’ service names.
Make sure it works:
$ curl https://my-nginx.playground.zalan.do
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...
</body>
</html>
For reference, the full service description should look like this:
apiVersion: v1
kind: Service
metadata:
name: nginx
annotations:
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:eu-central-1:some-account-id:certificate/some-cert-id
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
external-dns.alpha.kubernetes.io/hostname: my-nginx.playground.zalan.do
spec:
type: LoadBalancer
ports:
- port: 443
targetPort: 80
selector:
app: nginx
Common pitfalls¶
When accessing your service from another pod make sure to specify both port and protocol¶
Kubernetes clusters usually run an internal DNS server that allows you to reference services
from inside the cluster via DNS names rather than IPs. The internal DNS name for this example
is nginx.default.svc.cluster.local
. So, from inside any pod of the cluster you can lookup
your service with:
$ dig +short nginx.default.svc.cluster.local
10.3.0.245
But don’t get confused due to the mixed ports: Your service just forwards to the plain HTTP endpoints of your nginxs but serves them on port 443, as HTTP. So to avoid confusion when accessing your service from another pod make sure to specify both port and protocol.
$ curl http://nginx.default.svc.cluster.local:443
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...
</body>
</html>
Note that we use HTTP on port 443 here.
Service accounts¶
In Kubernetes, service accounts are used to provide an identity for pods.
Pods that want to interact with the API server will authenticate with a particular service account. By default, applications will authenticate as the default
service account in the namespace they are running in.
This means, for example, that an application running in the test
namespace will use the default service account of the test
namespace.
Access Control¶
Applications are authorized to perform certain actions based on the service account selected. We currently allow the following service accounts:
- kube-system:system
- Used only for admin access in kube-system namespace.
- kube-system:default
- Used for read only access in the kube-system namespace.
- default:default
- Gives read-only access to the Kubernetes API.
- *:operator
- Gives full access to the used namespace and read-write access to TPR, storage classes, persistent volumes in all namespaces.
Additional service accounts are used by the Kubernetes’ controller manager to allow it to work properly.
How to create service accounts¶
Service accounts can be created for your namespace via pipelines (or via zkubectl
in test clusters) by placing the respective YAML in the apply
folder and executing it.
For example, to request operator
access you will need to create the following service account:
apiVersion: v1
kind: ServiceAccount
imagePullSecrets:
- name: pierone.stups.zalan.do # required to pull images from private registry
metadata:
name: operator
namespace: $YOUR_NAMESPACE
The service account can be used in an example deployment like this:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
namespace: acid
spec:
replicas: 1
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
serviceAccountName: operator #this is where your service account is specified
hostNetwork: true
FAQ¶
How do I...¶
- ... ensure that my application runs in multiple Availability Zones?
- The Kubernetes scheduler will automatically try to distribute pods across multiple “failure domains” (the Kubernetes term for AZs).
- ... use the AWS API from my application on Kubernetes?
- Create an IAM role via CloudFormation and assign it to your application pods. The AWS SDKs will automatically use the assigned IAM role. See AWS IAM integration for details.
- ... get OAuth access tokens in my application on Kubernetes?
- Your application can declare needed OAuth credentials (tokens and clients) via the
PlatformCredentialsSet
. See Zalando Platform IAM Integration for details. - ... read the logs of my application?
- The most convenient way to read your application’s logs (stdout and stderr) is by filtering by the
application
label in the Scalyr UI. See Logging for details. - ... get access to my Scalyr account?
- You can approach one of your colleagues who already has access to invite you to your Scalyr account.
- ... switch traffic gradually to a new application version?
- Traffic switching is currently implemented by scaling up the new deployment and scaling down the old version. This process is fully automated and cannot be controlled in the current CI/CD Jenkins pipeline. The future deployment infrastructure will probably support manual traffic switching.
- ... use different namespaces?
- We recommend using the “default” namespace, but you can create your own if you want to.
- ... quickly get write access to production clusters for 24x7 emergencies?
- We still need to set up the Emergency Operator workflow: the idea is to quickly give full access to production accounts and clusters in case of incidents. Eric’s idea is to require a real 24x7 INCIDENT ticket for getting access (this would ensure that it’s not misused for day-to-day work). Right now (2017-05-15) you can call STUPS 24x7 2nd level (via 1st level) to ask for emergency access.
- ... use a single Jenkins for both building (CI) and deployment (CD)? That would enable more sophisticated pipelines because no extra communication between CI and CD was needed. CI needs feedback look from CD in order to perform joint activities.
- The Jenkins setup will be replaced by the Continuous Delivery Platform which performs both builds and deploys. See https://pages.github.bus.zalan.do/continuous-delivery/cdp-docs/ and watch out for announcements and Friday Demos.
- ... test deployment YAMLs from CLI?
$ zdeploy render-template deployment.yaml application=xxx version=xx | zkubectl create
- ... access a production service from my test cluster?
- Test clusters are not allowed to get production OAuth credentials, please use a staging service and sandbox OAuth credentials.
- ... decide when to place a declaration under the
apply
folder, and when at the root (it doesn’t seem to be standard)? - The current Jenkins CI/CD pipeline relies on some Zalando conventions: every
.yaml
file in theapply
folder is applied as a Kubernetes manifest or Cloud Formation template. Some files need to be on the “root” folder as they are processed in a special way, these files are e.g.:deployment.yaml
,autoscaling.yaml
andpipeline.yaml
. - ... use Helm together with Kubernetes on AWS?
- We don’t currently (May 2017) support it because it requires the installation of some components in the
kube-system
namespace. This namespace is reserved for core cluster components as defined in the Kubernetes on AWS configuration and is not accessible to users. Furthermore, the Zalando “compliance by default” requirements (delivering stacks over declarations in a Zalando git repo) would clash with Helm defaults.
Will the cluster scale up automatically and quickly in case of surprise need of more pods?¶
Cluster autoscaling is purely based on resource requests, i.e. as soon as the resource requests increase (e.g. because the number of pods goes up) the autoscaler will set a new DesiredCapacity of the ASG. The autoscaler is very simple and not based on deltas, but on absolute numbers, i.e. it will potentially scale up by many nodes at once (not one by one). See https://github.com/hjacobs/kube-aws-autoscaler#how-it-works
Admin’s Guide¶
How to create, update and operate Kubernetes clusters.
Running Kubernetes in Production¶
Tip
Start by watching our meetup talk “Kubernetes on AWS at Europe’s Leading Online Fashion Platform” on YouTube, to learn how we run Kubernetes on AWS in production. (slides)
This document should briefly describe our learnings in Zalando Tech while running Kubernetes on AWS in production. As we just recently started to migrate to Kubernetes, we consider ourselves far from being experts in the field. This document is shared in the hope that others in the community can benefit from our learnings.
Context¶
We are a team of infrastructure engineers provisioning Kubernetes clusters for our Zalando Tech delivery teams. We plan to have more than 30 production Kubernetes clusters. The following goals might help to understand the remainder of the document, our Kubernetes setup and our specific challenges:
- No manual operations: all cluster updates and operations need to be fully automated.
- No pet clusters: clusters should all look the same and not require any specific configurations/tweaking
- Reliability: the infrastructure should be rock-solid for our delivery teams to entrust our clusters with their most critical applications
- Autoscaling: clusters should automatically adapt to deployed workloads and hourly scaling events are expected
- Seamless migration: Dockerized twelve-factor apps currently deployed on AWS/STUPS should work without modifications on Kubernetes
Cluster Provisioning¶
There are many tools out there to provision Kubernetes clusters. We chose to adapt kube-aws as it matches our current way of working on AWS: immutable nodes configured via cloud-init and CloudFormation for declarative infrastructure. CoreOS’ Container Linux perfectly matches our understanding of the node OS: only provide what is needed to run containers, not more.
Only one Kubernetes cluster is created per AWS account. We create separated AWS accounts/clusters for production and test environments.
We always create two AWS Auto Scaling Groups (ASGs, “node pools”) right now:
- One master ASG with always two nodes which run the API server and controller-manager
- One worker ASG with 2 to N nodes to run application pods
Both ASGs span multiple Availability Zones (AZ). The API server is exposed with TLS via a “classic” TCP/SSL Elastic Load Balancer (ELB).
We use a custom built Cluster Registry REST service to manage our Kubernetes clusters. Another component (Cluster Lifecycle Manager, CLM) is regularly polling the Cluster Registry and updating clusters to the desired state. The desired state is expressed with CloudFormation and Kubernetes manifests stored in git.
Different clusters can use different channel configurations, i.e. some non-critical clusters might use the “alpha” channel with latest features while others rely on the “stable” channel. The channel concept is similar to how CoreOS manages releases of Container Linux.
Clusters are automatically updated as soon as changes are merged into the respective branch. Configuration changes are first tested in a separate feature branch, afterwards the pull request to the “dev” branch (channel) is automatically tested end-to-end (this includes the official Kubernetes conformance tests).
AWS Integration¶
We provision clusters on AWS and therefore want to integrate with AWS services where possible. The kube2iam daemon conveniently allows to assign an AWS IAM role to a pod by adding an annotation. Our infrastructure components such as the autoscaler use the same mechanism to access the AWS API with special (restricted) IAM roles.
Ingress¶
There is no official way of implementing Ingress on AWS. We decided to create a new component Kube AWS Ingress Controller to achieve our goals:
- SSL termination by ALB: convenient usage of ACM (free Amazon CA) and certificates upload to AWS IAM
- Using the “new” ELBv2 Application Load Balancer
We use Skipper as our HTTP proxy to route based on Host header and path. Skipper is running as a DaemonSet
on all worker nodes for convenient AWS ASG integration (new nodes are automatically registered in the ALB’s Target Group).
Skipper directly comes with a Kubernetes data client to automatically update its routes periodically.
External DNS is automatically configuring the Ingress hosts as DNS records in Route53 for us.
Resources¶
Understanding the Kubernetes resource requests and limits is crucial.
Default resource requests and limits can be configured via the LimitRange resource. This can prevent “stupid” incidents like JVM deployments without any settings (no memory limit and no JVM heap set) eating all the node’s memory. We currently use the following default limits:
$ kubectl describe limits
Name: limits
Namespace: default
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- ---- --------------- ------------- -----------------------
Container cpu - 16 100m 3 -
Container memory - 64Gi 100Mi 1Gi -
The default limit for CPU is 3 cores as we discovered that this is a sweet spot for JVM apps to startup quickly. See our LimitRange YAML manifest for details.
We provide a tiny script and use the Downwards API to conveniently run JVM applications on Kubernetes without the need to manually set the maximum heap size. The container spec of a Deployment
for some JVM app would look like this:
# ...
env:
# set the maximum available memory as JVM would assume host/node capacity otherwise
# this is evaluated by java-dynamic-memory-opts in the Zalando OpenJDK base image
# see https://github.com/zalando/docker-openjdk
- name: MEM_TOTAL_KB
valueFrom:
resourceFieldRef:
resource: limits.memory
divisor: 1Ki
resources:
limits:
memory: 1Gi
Kubelet can be instructed to reserve a certain amount of resources for the system and for Kubernetes components (kubelet itself and Docker etc). Reserved resources are subtracted from the node’s allocatable resources. This improves scheduling and makes resource allocation/usage more transparent. Node allocatable resources or rather reserved resources are also visible in Kubernetes Operational View:

Graceful Pod Termination¶
Kubernetes will cause service disruptions on pod terminations by default as applications and configuration need to be prepared for graceful shutdown.
By default, pods receive the TERM signal and kube-proxy
reconfigures the iptables
rules to stop traffic to the pod.
The pod will be killed 30s later by a KILL signal if it did not terminate by itself before.
Kubernetes expects the container to handle the TERM signal and at least wait some seconds for kube-proxy
to change the iptables
rules.
Note that the readinessProbe behavior does not matter after having received the TERM signal.
There are two cases leading to failing requests:
- The pod’s container terminates immediately when receiving the TERM signal — thus not giving
kube-proxy
enough time to remove the forwarding rule - Keep-alive connections are not handed over by Kubernetes, i.e. requests from clients with keep-alive connection will still be routed to the pod
Keep-alive connections are the default when using connection pools. This means that nearly all client connections between microservices are affected by pod terminations.
Kubernetes’ default behavior is a blocker for seamless migration from our AWS/STUPS infrastructure to Kubernetes. In STUPS, single Docker containers run directly on EC2 instances. Graceful container termination is not needed as AWS automatically deregisters EC2 instances and drains connections from the ELB on instance termination. We therefore consider solving the graceful pod termination issue in Kubernetes on the infrastructure level. This would not require any application code changes by our users (application developers).
For further reading on the topic, you can find a blog post about graceful shutdown of node.js on Kubernetes and a small test app to see the pod termination behavior.
Autoscaling¶
Pod Autoscaling¶
We are using the HorizontalPodAutoscaler resource to scale the number of deployment replicas. Pod autoscaling requires implementing graceful pod termination (see above) to downscale safely in all circumstances. We only used the CPU-based pod autoscaling until now.
Node Autoscaling¶
Our experimental AWS Autoscaler is an attempt to implement a simple and elastic autoscaling with AWS Auto Scaling Groups.
Graceful node shutdown is required to allow safe downscaling at any time. We simply added a small systemd unit to run kubectl drain on shutdown.
Upscaling or node replacement poses the risk of race conditions between application pods and required system pods (DaemonSet). We have not yet figured out a good way of postponing application scheduling until the node is fully ready. The kubelet’s Ready condition is not enough as it does not ensure that all system pods such as kube-proxy and kube2iam are running. One idea is using taints during node initialization to prevent application pods to be scheduled until the node is fully ready.
Monitoring¶
We use our Open Source ZMON monitoring platform to monitor all Kubernetes clusters.
ZMON agent and workers are part of every Kubernetes cluster deployment. The agent automatically pushes both AWS and Kubernetes entities to the global ZMON data service.
The Prometheus Node Exporter is deployed on every Kubernetes node (as a DaemonSet
) to expose system metrics such as disk space, memory and CPU to ZMON workers.
Another component kube-state-metrics is deployed in every cluster to expose cluster-level metrics such as number of waiting pods. ZMON workers also have access to the internal Kubernetes API server endpoint to build more complex checks. AWS resources can be monitored by using ZMON’s CloudWatch wrapper.
We defined global ZMON checks for cluster health, e.g.:
- Number of ready and unschedulable nodes (collected via API server)
- Disk, memory and CPU usage per node (collected via Prometheus Node Exporter and/or CloudWatch)
- Number of endpoints per Kubernetes service (collected via API server)
- API server requests and latency (collected via API server metrics endpoint)
We use Kubernetes Operational View for ad-hoc insights and troubleshooting.
Jobs¶
We use the very convenient Kubernetes CronJob resource for various tasks such as updating all our SSH bastion hosts every week.
Kubernetes jobs are not cleaned up by default and completed pods are never deleted. Running jobs frequently (like every few minutes) quickly thrashes the Kubernetes API server with unnecessary pod resources. We observed a significant slowdown of the API server with increasing number of completed jobs/pods hanging around. To mitigate this, A small kube-job-cleaner script runs as a CronJob every hour and cleans up completed jobs/pods.
Security¶
We authorize access to the API server via a proprietary webhook which verifies the OAuth Bearer access token and looks up user’s roles via another small REST services (backed historically by LDAP).
Access to etcd should be restricted as it holds all of Kubernetes’ cluster data thus allowing tampering when accessed directly.
We use flannel as our overlay network which requires etcd by default to configure its network ranges. There is experimental support for the flannel backend to be switched to the Kubernetes API server. This allows restricting etcd access to the master nodes.
Kubernetes allows to define PodSecurityPolicy resources to restrict the use of “privileged” containers and similar features which allow privilege escalation.
Docker¶
Docker is often beautiful and sometimes painful, especially when trying to run containers reliable in production. We encountered various issues with Docker and all of them are not really Kubernetes related, e.g.:
- Docker 1.11 to 1.12.5 included an evil bug where the Docker daemon becomes unresponsive (
docker ps
hangs). We hit this problem every week on at least one of our Kubernetes nodes. Our workaround was upgrading to Docker 1.13 RC2 (we now moved back to 1.12.6 as the fix was backported). - We saw some processes getting stuck in “pipe wait” while writing to STDOUT when using the default Docker
json
logger (root cause was not identified yet). - There seem to be a lot more race conditions in Docker and you can find many “Docker daemon hangs” issues reported, we already expect to hit them once in a while.
- Upgrading Docker clients to 1.13 broke pulls from our Pier One registry (pulls from gcr.io were broken too). We implemented a quick workaround in Pier One until Docker fixed the issue upstream.
- A thread on Twitter suggested adding the
--iptables=false
flag for Docker 1.13. We spend some time until we found out that this is a bad idea. NAT for the Flannel overlay network breaks when adding--iptables=false
.
We learned that Docker can be quite painful to run in production because of the many tiny bugs (race conditions). You can be sure to hit some of them when running enough nodes 24x7. Also better not touch your Docker version once you have a running setup.
etcd¶
Kubernetes relies on etcd for storing the state of the whole cluster. Losing etcd consensus makes the Kubernetes API server essentially read only, i.e. no changes can be performed in the cluster. Losing etcd data requires rebuilding the whole cluster state and would probably cause a major downtime. Luckily all data can be restored as long as at least one etcd node is alive.
Knowing the criticality of the etcd cluster, we decided to use our existing, production-grade STUPS etcd cluster running on EC2 instances separate from Kubernetes. The STUPS etcd cluster registers all etcd nodes in Route53 DNS and we use etcd’s DNS discovery feature to connect Kubernetes to the etcd nodes. The STUPS etcd cluster is deployed across availability zones (AZ) with five nodes in total. All etcd nodes run our own STUPS Taupage AMI, which (similar to CoreOS) runs a Docker image specified via AWS user data (cloud-init).
Public Presentations¶
- Large Scale Kubernetes on AWS at Europe’s Leading Online Fashion Platform - Docker Hamburg Meetup
- PostgreSQL on Kubernetes - Docker Hamburg Meetup
- From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
- Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
- Kubernetes at Zalando - CNCF End User Committee Presentation
- Kubernetes on AWS at Europe’s Leading Online Fashion Platform
- Running Kubernetes in Production on AWS
- “Kubernetes on AWS at Europe’s Leading Online Fashion Platform” on YouTube
Developers’ Guide¶
Developers’ guide for the Kubernetes on AWS project.
Contents:
Repositories¶
Relevant Repositories¶
The following OSS repositories are relevant for the project and code, issues and information that must be checked regularly:
- Kubernetes on AWS (GH)
- https://github.com/zalando-incubator/kubernetes-on-aws
- STUPS etcd
- https://github.com/zalando/stups-etcd-cluster
- Zalando Kubectl Wrapper
- https://github.com/zalando-incubator/zalando-kubectl
- External DNS
- https://github.com/kubernetes-incubator/external-dns
- Kubernetes AWS Cluster Autoscaler
- https://github.com/hjacobs/kube-aws-autoscaler
- Kubernetes Job Cleaner
- https://github.com/hjacobs/kube-job-cleaner
- Kubernetes Operational View
- https://github.com/hjacobs/kube-ops-view
The responsible maintainers are indicated in the respectibe MAINTAINERS file in each repository.
Ingress¶
These repositories are relevant for supporting Kubernetes Ingress
resources on AWS:
- Skipper with Kubernetes data client
- https://github.com/zalando/skipper
- Ingress AWS ELB Controller
- https://github.com/zalando-incubator/kube-ingress-aws-controller
ADR-001: Store cluster versions in Cluster Registry¶
Context¶
The Cluster Lifecycle Manager (CLM) can pull configuration from two separate sources, e.g. the Cluster Registry and a channel source (git repository).
The CLM must be able to go away and later come back to continue where it left off. Therefore it must store the current cluster configuration state somewhere. The cluster configuration state is defined by a version of the channel source and the configuration currently stored in a Cluster Registry.
The configuration state should be in a format that is not tied to a specific implementation of the CLM this way the CLM can support multiple ways of provisioning clusters.
The CLM should be able to support multiple configuration sources and be able to provision clusters in multiple ways. Therefore the configuration format should not be tied to a specific implementation of configuration sources or provisioning method.
Decision¶
CLM will store the current configuration state under the status
field of the Cluster
resource in the Cluster Registry. The configuration state will be stored as three versions:
next_version
- This indicates that the cluster is being updated to this version next, this is mostly used for debugging purposes.
current_version
- This is the current version the cluster has.
last_version
- This is the last working version. The last version is also used for rolling back a cluster in case the new version is broken.
Each version is a string defined as the channel version (git commit sha1) concatenated with a separator character “#” and the sha1 hash of the current cluster config (excluding the status
field) from the Cluster Registry.
We decided to encode the version as a simple string because:
- Splitting into multiple fields or properties would push the CLM implementation detail unnecessarily to the Cluster Registry schema (as we might think of different implementations with more version parts, e.g. a Kops provisioner relying on a certain Kops version)
- The concrete version string format only needs to be known in one place (the CLM provisioner implementation), i.e. the string can be opaque to all other systems
- A simple string field is easily read and “parsed” by humans for debugging
- KISS
Status¶
Accepted.
Consequences¶
- CLM will be responsible for deriving the cluster config hash based on the cluster config from the Cluster Registry.
- CLM will be responsible for concatenating/comparing version strings. The Cluster Registry will for instance not be aware of the format of the versions which are stored there.
- CLM can have several provisioner implementations which each can define its own versioning format without requiring changes in the Cluster Registry.
ADR-002: Installation of Kubernetes non core system components¶
Context¶
In cluster.py
we used to install all the kube-system
components using a systemd
unit. This consisted basically in a bash script that deployed all the manifests from /srv/kubernetes/manifests/*/*.yaml
using kubectl
.
We obviously do not want to update versions manually via kubectl. Furthermore, this approach also meant that we had to launch a new master instance in order to apply the updated manifests.
Decision¶
We will do the following:
- remove entirely the “install-kube-system” unit from the master user data.
- create a folder with all the manifests for each of the kubernetes artifact
- apply all the manifests from the Cluster Lifecycle Manager code
Some of the possible alternatives for the folder structures are:
- /manifests/APPLICATION_NAME/deployment.yaml - which uses a folder structure that includes the APPLICATION_NAME
- /manifests/APPLICATION_NAME/KIND/mate.yaml - which uses a folder structure that includes APPLICATION_NAME and KIND
- /manifests/mate-deployment.yaml - where we have a flat structure and the filenames contain the name of the application and the kind
- /manifests/mate.yaml - where mate.yaml contains all the artifacts of all kinds related to mate
We choose number 1 as it seems the most compelling alternative. Number 2 will only introduce an additional folder level that does not provide any benefit. Number 3 will instead rely on a naming convention on the given kind. Number 4, instead, is a competitive alternative to number 1 and could be adopted, but we prefer to go with number 1 as this is very flexible and probably more readable for the maintainer. For the file naming convention, we recommend to split in files for kind when is possible and put the name (or just a prefix) in the file name. We will not make any assumption on the file naming scheme in the code. Also, no assumption will be made on the order of execution of such files.
Status¶
Accepted.
Consequences¶
The chosen file convention will be relevant when discussing the removal of components from kube-system
.
This is currently out of scope for this ADR as this only covers the “apply” case.
ADR-003: Organize cluster versions in branches¶
Context¶
When managing multiple clusters with different SLOs there is a need for pinning different clusters to different channels of the cluster configuration. For instance a production cluster might require a more stable channel of the cluster configuration than a test or playground cluster where we want to try out new, not yet stable, features.
To be able to manage multiple channels for different clusters we need to define a process describing:
- What defines a channel.
- How to move patches/hotfixes between channels.
- How to promote an “unstable” channel to “stable”.
- How to try out experimental features.
Decision¶
Cluster configuration channels will map to git branches in the configuration repository. The branch layout is shown below.
PR (experimental-branch-1)-
\
PR (feature-2) ------------------> dev
/ \
PR (hotfix-3) ----------------------> alpha
\ \
\----------> beta
\
stable
dev
is the default branch and is the main entrypoint for new feature PRs.
Every new feature should therefore start as a PR targeting dev
and should
flow to the other channels only from the dev
channel. Critical hotfixes can
go directly to the relevant channels.
Experimental features should be tested on a separate branch which is based on
`dev`
before they are merged into the `dev`
branch.
Specifying the channel for a cluster is done by assigning a branch/channel name to the channel field of a cluster resource in the Cluster Registry.
- TBD: when is something considered ready to be promoted? (after X days automatically)?
- TBD: how is something promoted from dev to alpha (and further up)?
- TBD: what controls do we need when promoting (four eyes?)?
Status¶
Proposed.
Consequences¶
- The default branch of kubernetes-on-aws becomes
dev
. - We need to protect
dev
/alpha
/beta
/stable
branches.
ADR-004: Roles and Service Accounts¶
Context¶
We need to define roles and service accounts to allow all our use cases. Our first concerns are to allow the following:
- Users should be able to deploy (manually in test cluster, via the deploy API in production clusters), but we do not want them by default to read secrets
- Admins should get full access to all resources, mostly for emergency access
- Applications should not get by default write access to the Kubernetes API
- It should be possible for some applications to write to the Kubernetes API.
Decision¶
We define the following Roles:
- ReadOnly: allowed to read every resource, but not secrets. “exec” and “proxy” and similar operations are not allowed. Allowed to do “port-forward” to special proxy, which will enable DB access.
- PowerUser: “restricted” Pod Security Policy with write access to all namespaces but kube-system, ReadOnly access to kube-system namespace, “exec” and “proxy” are allowed, RW for secrets, no write of daemonsets. DB access through “port-forward” and special proxy.
- Operator: “privileged” Pod Security Policy with write access to the own namespace and read and write access to third party resources in all namespaces.
- Controller: Kubernetes component controller-manager is not allowed to “use” other Pod Security Policies then “restricted”, such that serviceAccount authorization is used to check the permission. To all other resources it has full access.
- Admin: full access to all resources
And the following pairs <namespace, service account> that will get the listed role, assigned by the WebHook:
- “kube-system:default” - Admin
- “default:default” - ReadOnly
- “*:operator” - Operator
- kube-controller-manager - Controller
- kubelet - Admin
Application that will want write access to the Kubernetes API will have to use the “operator” service account.
Status¶
Accepted.
Consequences¶
This decision is a breaking change of what was previously defined for applications. Users that need applications with write access to the Kubernetes API will need to select the right service account. The controller-manager has now an identity and uses the secured kube-apiserver endpoint, such that it can be authorized by the webhook.
ADR-005: How to use channels¶
Context¶
In ADR-004 we defined some initial characteristics of clusters and how they map to git branches and the definition of the dev
branch/channel.
We still needed to define how many branches we use and how to promote from one to the other. In this ADR we answer those remaining questions.
Decision¶
We decided to use the following branches/channels:
dev
- this is the default branch for the project. By default all new features and bugfixing that are not to be considered as
hotfixes
will be against this channel. alpha
- this is the branch immediately after
dev
. stable
- this is the most stable branch.
The following diagram shows how the process of working with the channels works:
Every branch with a pull request open will trigger end to end testing as soon as the pull request is labelled as “ready-to-test”. While discussing the end to end (e2e) strategy is out of scope for this ADR, we define here the following requirements:
- The e2e testing infrastructure will create a new cluster
- All the e2e tests will run on the aforementioned cluster
- The e2e tests will report the status to the PR
- The cluster will be deleted as soon as the tests finish
We decide against polluting the cluster registry which means that the testing infrastructure will create the cluster using the Cluster Lifecycle Manager (CLM) functionalities locally and not by creating entries in the cluster registry.
Once the PR is approved and merged into the dev
branch/channel, all the clusters using the channel will be updated. The list includes as a minimum and by design the following clusters:
- Infrastructure cluster to test cluster setup
The cluster above will be tested with smoke tests that could include the end to end tested executed on the branch. Testing on this cluster has the following goals:
- Testing the update on an updated cluster and not on a fresh cluster as this might show some different behavior.
- Testing the impact of the update on already running applications. This requires that the cluster beeing update will have running applications covering different Kubernetes features.
Additionally to the tests, ZMON metrics will be monitored closely.
If nothing wrong is seen, after X hours, an automatic merge into the alpha
branch will be executed. This will trigger updates to all the cluster running the alpha branch. This include as minimum the following clusters:
- Infrastructure prod cluster
- Infrastructure test cluster
- playground
A transition to stable
channel will be started by an automatic creation of a PR after Y days. Differently from the previous step, this PR
will not be automatically merged, but will require additional human approval.
Hotfixes¶
Hotfixes can be made via PRs to any of the channels. This allows to increase the speed by which we fix important issues in all the channels. A hotfix PR will still need to pass the e2e tests in order to be merged.
Status¶
Accepted.
Consequences¶
The following are important consequences:
- No manual merges to
alpha
orstable
channels are possible for changes that are not hotfixes. - With this ADR we rely heavily on our e2e testing infrastructure to guarantee stability/quality of our PRs.
- Clusters might be assigned to different channels depending of different requirements like SLOs.