Are failing Jobs consuming resources on Kubernetes clusters?

lemmings off the cliff - Uh oh

Introduction

It is sometimes useful to assign tasks to containers, which will run them to completion. Unlike server processes these containers are expected to terminate. Typical use cases are:

  • Batch processing
  • Maintenance operation
  • Data processing

In OpenShift / Kubernetes a resource type Job exists for the purpose. Unlike a bare bone Pod, a Job is monitored by a controller that will continue to retry execution of the Pods until a specified number of them successfully terminate.

When the underlying mechanisms are not well understood the behavior of Jobs in failure scenarios and its impact on the cluster and its resources can be surprising. I got this simple question from one of the customers I work with: Are my failing jobs consuming resources on my OpenShift / Kubernetes cluster?
As often in IT the simple answer to a simple question is “it depends”. I realized that pieces of information to provide a more complete (and useful!) answer to the question are available but scattered in multiple places. This post is a trial of bringing all pieces together.

Separation of concerns

What is commonly understood as a Job failure, which may be more the failure of a container or a pod managed by a Job in certain cases, can be handled at two levels.

  • The Kubelet: this is the process part of Kubernetes platform that manages the Pod life cycle. It runs on every node and is only responsible for the pods on the same node.
  • The Job controller: resources are managed in Kubernetes by controllers, the Job controller is a central process that manages all Jobs for the complete cluster.

Now that we have set the scene let’s take concrete failure scenarios and look at what happens at the Node and cluster level.

CrashloopBackOff

Our first failure scenario is when Pods get into CrashloobBackOff. This is not unusual as it can happen under multiple circumstances.

  • The process inside the container terminates, possibly crashing with a non zero (success) exit code.
  • Fields of the Pod definition have been configured incorrectly, the image reference being a frequent culprit
  • Faulty configuration in Kubernetes. The list may be long as it can be anything from unreachable registry or API to issues with mounting volumes

Let’s create a simple Job referencing an image that does not exist and see what happens.

apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  selector: {}
  template:
    metadata:
      name: test
    spec:
      containers:
        - name: test
          resources:
            requests:
              cpu: 100m
              memory: 200Mi
          image: xxxx
          command:
            - /bin/bash
          args: 
            - "-c"
            - "sleep 30; exit 0;"
      restartPolicy: Never

Let’s check the status

  • At the Pod level we see that the events are pilling up: Pulling image “xxxx” followed by Error: ImagePullBackOff.
  • At the Job level everything seems normal: there is 1 active Pod and no failure event reported.

What is happening is the following:

  • The Job controller requested a Pod to be created to align the cluster status to what was requested by the Job creation
  • The Scheduler assigned the Pod to a Node
  • The Kubelet on the Node tried to create the containers for the Pod and failed to pull the image
  • The Kubelet continues to retry creating the containers

One may think that “restartPolicy: Never” may avoid this scenario but this is not the case. It would only apply if the container process was failing (non zero exit code). We will look at that later.

What does it mean for resources? At the node level failed containers are restarted by the Kubelet with an exponential back-off delay (10s, 20s, 40s…) capped at 5 minutes. This means in this scenario that CPU, memory and storage consumption is close to null as the container cannot be created.

At the cluster level it looks a bit different. When the Scheduler allocated the Pod to the Node, the requested resources (cpu: 100m and memory: 200Mi) have been reserved for the Pod. In a scenario with automated creation you may have thousands of such Pods reserving resources so that at some point all clusters resources may have been exhausted. New Pods requesting resources cannot be scheduled anymore.
A mechanism for protecting the cluster against that is to define a quota at the namespace level. When the quota has been reached Pods inside the namespace requesting resources are not admitted anymore but Pods in other namespaces can still be created.

Application failure

Let’s now make a very small change to our scenario. We will be referencing a valid image but the container will exit with a failure code.

apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  selector: {}
  template:
    metadata:
      name: test
    spec:
      containers:
        - name: test
          resources:
            requests:
              cpu: 100m
              memory: 200Mi
          image: image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest
          command:
            - /bin/bash
          args: 
            - "-c"
            - "sleep 30; exit 1;"
      restartPolicy: Never

Let’s check the status again

  • At the Pod level we see that the status is now Failed and a new Pod gets created after the failure of the previous one.
  • At the Job level the failed Pods are now reported and after some time the Job is marked as Failed.

One thing that may be confusing is that we specified restartPolicy: Never. It is important to understand that the Job specification restart policy only applies to the Pods, and not the Job controller. If we had specified restartPolicy: OnFailure we will have the same behavior as in the preceding scenario. The Kubelet would restart the container and the Controller would not see the failure, the only difference being that the count of restarts is incremented at the Pod level (not at the Job level).

With Never as restartPolicy the Pod is failing and reported as such by the Kubelet. The Job controller reacts then by starting a new Pod. This new Pod may or may not be scheduled on the same Node. The number of times a new Pod is started is controlled by two parameters backoffLimit (default value is 6) and activeDeadlineSeconds (illimited by default). In our example we had not specified these parameters and the Pod got restarted 6 times. We can see 7 failed Pods at the end.

In terms of resources there is the regular consumption when a Pod is running. When Pods are terminated they don’t actively consume CPU or RAM resources but still use storage space on the Nodes for the logs. The resources for the Pod are not continuously reserved either. Only when the new Pod is scheduled. After the Pod has terminated the resource is released…. till a new Pod gets started. This however goes into the Scheduler work queue. If no Node with enough resource is available the Pod will need to wait.

Here it is important to have a limitation, hence the default value for backoffLimit. Without the limit you can quickly come to thousand of Pods that consume

  • storage space on the nodes for the logs
  • etcd space for the object definition
  • storage space in central logging
  • memory and storage space in Prometheus for metrics. Pod names are an unbounded set

Let’s consider impact of automation CronJobs or Jobs created by Continuous Integration (CI) tools with this scenario. I have seen clusters where there were more than hundred thousands of Jobs.

  • there were Out Of Memory (OOM) events for Prometheus
  • new Jobs were slow to start (taking 10 to 20 minutes)
  • API was slower than usual

An interesting point is that the cluster was still operational. It was possible to create Pods but creating the same Pods as part of a Job was taking a long time (not failing). It was really the creation of the Pod by the Job controller not the allocation (scheduling) of the Pod to a Node by the scheduler that was slow. The reason was the huge work queue of the Job controller. Nevertheless the Job controller may throttle new Pod creation due to excessive previous Pod failures in the same Job so that Pods for other Jobs get a chance to get created.

Note that Jobs don’t get automatically deleted after completion. It is a general good practice to remove them when they are not needed anymore as they increase the number of objects the API and etcd have to deal with. In case of CronJobs this can (and should) be done by configuring successfulJobsHistoryLimit and failedJobsHistoryLimit.

Conclusion

To conclude I will shortly mention two other scenarios.

When the service account specified in the Job does not exist the Job controller does not create a Pod and continuously retries it irrespective of the configured backofflimit.

When a PersistentVolumeClaim (PVC) is specified, which does not exist, the pod gets created but the scheduling fails. The pod stays in “Pending” state till its deletion or the creation of the PVC.

I hope this blog post provided a good overview of the components involved during Job failures, the caveats, protection mechanisms and, as the title reads, of the impact of Job failures on the cluster resources and their availability.

Thanks for reading!

Creating seccomp profiles on Fedora Silverblue

After looking at this video on the current state of container security I wanted to create seccomp profiles specific to my applications.

As I usually experiment in a virtual machines (or container) when it requires significant changes at the system level I thought it was a good time to test Fedora 32 Silverblue with a real life exercise.

Fedora Silverblue is an immutable desktop operating system. It aims to be extremely stable and reliable. It also aims to be an excellent platform for developers and for those using container-focused workflows.”

After installing the VM the first step was to create a toolbox:

$ toolbox create

and then to use it 🙂

$ toolbox enter

I just followed the clear instructions provided in this blog post to install the development environment and to build the OCI hook that will record the system calls generated by my container and create a matching seccomp profile.

When I first tried to run my container with podman inside the toolbox it failed. A toolbox runs as an unprivileged container and this issue seems to be blocking podman of running inside an unprivileged container. I could surely have run the toolbox as a privileged container by using the podman command instead of toolbox but I wanted to stay with the Silverblue approach.

To workaround this I installed the OCI hook in folders inside my home directory, which made them available on my Silverblue host. It was not more complicate than setting two environment variables:

$ export HOOK_DIR=/var/home/fgiloux/oci/hooks/config/
$ export HOOK_BIN_DIR=/var/home/fgiloux/oci/hooks/bin
$ make install

The default directories on the host for OCI hooks are /usr/libexec/oci/hooks.d and /usr/share/containers/oci/hooks.d/ Everything under /usr is read-only in Silverblue. This was fine with me and I just kept the OCI hook in my home directory. I then ran my container with the appropriate parameters to generate the seccomp profile:

$ sudo podman run --name fakeapp --hooks-dir=/var/home/fgiloux/oci/hooks/config/ --annotation io.containers.trace-syscall=of:/var/home/fgiloux/rest-json-fakeapp/seccomp.json quay.io/fgiloux/rest-json-fakeapp:1.0-SNAPSHOT / > /dev/null

I made a call to my application (in reality you would run proper soak tests) and checked that everything was fine:

$ sudo podman exec fakeapp curl -s http://localhost:8080/fruits

Then I just had to harvest my seccomp profile. For my REST app built with Java / Quarkus the number of system call types was down from around 300 to 85. That’s quite substantial!

To run my container with my new profile is an easy thing:

$ podman run --name fakeapp --security-opt seccomp=/var/home/fgiloux/rest-json-fakeapp/seccomp.json quay.io/fgiloux/rest-json-fakeapp:1.0-SNAPSHOT / > /dev/null

But what about Kubernetes and OpenShift?

The use of seccomp profiles in pods can be controlled via annotations on the PodSecurityPolicy as documented here. It is possible to specify a profile stored on the host for containers and pods by setting annotations:

seccomp.security.alpha.kubernetes.io/pod: "localhost/profile.json"
container.security.alpha.kubernetes.io/: "localhost/profile.json"

What I would really want actually is an easy way to provision my application specific seccomp profile. Seccomp is however an alpha feature in Kubernetes, although work is being done to make it GA, and it is not possible today. I have good hope that this will be made possible through a ConfigMap or another mean in a near future as it seems to be one of the aim of the seccomp-operator.

For people interested in the subject I would also recommend the reading of Seccomp in Kubernetes — Part I: 7 things you should know before you even start!

To wrap up on my Fedora 32 Silverblue small experiment I really liked it although I only scratched the surface. A few things that are clear to me is what I have described in this blog post is exploration. When I am settled on a change I am unlikely to just manually run commands inside a container/toolbox to configure my environment. I have read that people have been leveraging ansible for getting a reproducible / amendable path. I may use this approach or simple batch scripts with buildah and possibly podman compose if I happen to keep many toolboxes. The bottom line is no “dnf update” but rerunning image creation scripts pointing to newer package versions whenever required.

For now I will keep my Silverblue VM around, try to leverage it and see whether it becomes my default environment soon!

Automating tests and metrics gathering for Kubernetes and OpenShift (part 3)

This is the third of a series of three articles based on a session I held at Red Hat Tech Exchange EMEA 2018. They were first published on developers.redhat.com. In the first article, I presented the rationale and approach for leveraging Red Hat OpenShift or Kubernetes for automated performance testing, and I gave an overview of the setup. In the second article, we looked at building an observability stack. In this third part, we will see how the execution of the performance tests can be automated and related metrics gathered.

An example of what is described in this article is available in my GitHub repository.

Continue reading

Building an observability stack for automated performance tests on Kubernetes and OpenShift (part 2)

This is the second of a series of three articles based on a session I held at Red Hat Tech Exchange 2018 in EMEA. They were first published on developers.redhat.com. In the first article, I presented the rationale and approach for leveraging Red Hat OpenShift or Kubernetes for automated performance testing, and I gave an overview of the setup.

In this article, we will look at building an observability stack. In production, the observability stack can help verify that the system is working correctly and performing well. It can also be leveraged during performance tests to provide insight into how the application performs under load.

An example of what is described in this article is available in my GitHub repository.

Continue reading

Leveraging Kubernetes and OpenShift for automated performance tests (part 1)

This is the first article in a series of three articles based on a session I hold at Red Hat Tech Exchange EMEA 2018. They were first published on developers.redhat.com. In this first article, I present the rationale and approach for leveraging Red Hat OpenShift or Kubernetes for automated performance testing, give an overview of the setup, and discuss points that are worth considering when executing and analyzing performance tests. I will also say a few words about performance tuning.

In the second article, we will look at building an observability stack, which—beyond the support it provides in production—can be leveraged during performance tests. Open sources projects like Prometheus, Jaeger, Elasticsearch, and Grafana will be used for that purpose. The third article will present the details for building an environment for performance testing and automating the execution with JMeter and Jenkins.

Continue reading

Structured application logs in OpenShift

Logs are like gold dust. Taken alone they may not be worth much, but put together and worked by a skillful goldsmith they may become very valuable. OpenShift comes with The EFK stack: Elasticsearch, Fluentd, and Kibana. Applications running on OpenShift get their logs automatically aggregated to provide valuable information on their state and health during tests and in production.

The only requirement is that the application sends its logs to the standard output. OpenShift does the rest. Simple enough!

In this blog I am covering a few points that may help you with bringing your logs from raw material to a more valuable product.

Continue reading

Container Images for OpenShift – Part 4: Cloud readiness

This is a transcript of a session I gave at EMEA Red Hat Tech Exchange 2017, a gathering of all Red Hat solution architects and consultants across EMEA. It is about considerations and good practices when creating images that will run on OpenShift. This fourth and last part focuses on the specific aspects of cloud-ready applications and the consequences concerning the design of the container images. Continue reading

Container Images for OpenShift – Part 3: Making your images consumable

This is a transcript of a session I gave at EMEA Red Hat Tech Exchange 2017, a gathering of all Red Hat solution architects and consultants across EMEA. It is about considerations and good practices when creating images that will run on OpenShift. This third part focuses on how you can make your images easier to consume by application developers or release managers. Continue reading

Container Images for OpenShift – Part 2: Structuring your images

This is a transcript of a session I gave at EMEA Red Hat Tech Exchange 2017, a gathering of all Red Hat solution architects and consultants across EMEA. It is about considerations and good practices when creating images that will run on OpenShift. This second part focuses on how you should structure images and group of images to achieve the objectives stated in part one. Continue reading

Container Images for OpenShift – Part 1: Objectives

This is a transcript of a session I gave at EMEA Red Hat Tech Exchange 2017, a gathering of all Red Hat solution architects and consultants across EMEA. It is about considerations and good practices when creating images that will run on OpenShift. The content is structured in a series of four posts:

  • Objectives
  • Structuring your images
  • Making your images consumable
  • Cloud readiness

Continue reading