Are failing Jobs consuming resources on Kubernetes clusters?

lemmings off the cliff - Uh oh

Introduction

It is sometimes useful to assign tasks to containers, which will run them to completion. Unlike server processes these containers are expected to terminate. Typical use cases are:

  • Batch processing
  • Maintenance operation
  • Data processing

In OpenShift / Kubernetes a resource type Job exists for the purpose. Unlike a bare bone Pod, a Job is monitored by a controller that will continue to retry execution of the Pods until a specified number of them successfully terminate.

When the underlying mechanisms are not well understood the behavior of Jobs in failure scenarios and its impact on the cluster and its resources can be surprising. I got this simple question from one of the customers I work with: Are my failing jobs consuming resources on my OpenShift / Kubernetes cluster?
As often in IT the simple answer to a simple question is “it depends”. I realized that pieces of information to provide a more complete (and useful!) answer to the question are available but scattered in multiple places. This post is a trial of bringing all pieces together.

Separation of concerns

What is commonly understood as a Job failure, which may be more the failure of a container or a pod managed by a Job in certain cases, can be handled at two levels.

  • The Kubelet: this is the process part of Kubernetes platform that manages the Pod life cycle. It runs on every node and is only responsible for the pods on the same node.
  • The Job controller: resources are managed in Kubernetes by controllers, the Job controller is a central process that manages all Jobs for the complete cluster.

Now that we have set the scene let’s take concrete failure scenarios and look at what happens at the Node and cluster level.

CrashloopBackOff

Our first failure scenario is when Pods get into CrashloobBackOff. This is not unusual as it can happen under multiple circumstances.

  • The process inside the container terminates, possibly crashing with a non zero (success) exit code.
  • Fields of the Pod definition have been configured incorrectly, the image reference being a frequent culprit
  • Faulty configuration in Kubernetes. The list may be long as it can be anything from unreachable registry or API to issues with mounting volumes

Let’s create a simple Job referencing an image that does not exist and see what happens.

apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  selector: {}
  template:
    metadata:
      name: test
    spec:
      containers:
        - name: test
          resources:
            requests:
              cpu: 100m
              memory: 200Mi
          image: xxxx
          command:
            - /bin/bash
          args: 
            - "-c"
            - "sleep 30; exit 0;"
      restartPolicy: Never

Let’s check the status

  • At the Pod level we see that the events are pilling up: Pulling image “xxxx” followed by Error: ImagePullBackOff.
  • At the Job level everything seems normal: there is 1 active Pod and no failure event reported.

What is happening is the following:

  • The Job controller requested a Pod to be created to align the cluster status to what was requested by the Job creation
  • The Scheduler assigned the Pod to a Node
  • The Kubelet on the Node tried to create the containers for the Pod and failed to pull the image
  • The Kubelet continues to retry creating the containers

One may think that “restartPolicy: Never” may avoid this scenario but this is not the case. It would only apply if the container process was failing (non zero exit code). We will look at that later.

What does it mean for resources? At the node level failed containers are restarted by the Kubelet with an exponential back-off delay (10s, 20s, 40s…) capped at 5 minutes. This means in this scenario that CPU, memory and storage consumption is close to null as the container cannot be created.

At the cluster level it looks a bit different. When the Scheduler allocated the Pod to the Node, the requested resources (cpu: 100m and memory: 200Mi) have been reserved for the Pod. In a scenario with automated creation you may have thousands of such Pods reserving resources so that at some point all clusters resources may have been exhausted. New Pods requesting resources cannot be scheduled anymore.
A mechanism for protecting the cluster against that is to define a quota at the namespace level. When the quota has been reached Pods inside the namespace requesting resources are not admitted anymore but Pods in other namespaces can still be created.

Application failure

Let’s now make a very small change to our scenario. We will be referencing a valid image but the container will exit with a failure code.

apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  selector: {}
  template:
    metadata:
      name: test
    spec:
      containers:
        - name: test
          resources:
            requests:
              cpu: 100m
              memory: 200Mi
          image: image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest
          command:
            - /bin/bash
          args: 
            - "-c"
            - "sleep 30; exit 1;"
      restartPolicy: Never

Let’s check the status again

  • At the Pod level we see that the status is now Failed and a new Pod gets created after the failure of the previous one.
  • At the Job level the failed Pods are now reported and after some time the Job is marked as Failed.

One thing that may be confusing is that we specified restartPolicy: Never. It is important to understand that the Job specification restart policy only applies to the Pods, and not the Job controller. If we had specified restartPolicy: OnFailure we will have the same behavior as in the preceding scenario. The Kubelet would restart the container and the Controller would not see the failure, the only difference being that the count of restarts is incremented at the Pod level (not at the Job level).

With Never as restartPolicy the Pod is failing and reported as such by the Kubelet. The Job controller reacts then by starting a new Pod. This new Pod may or may not be scheduled on the same Node. The number of times a new Pod is started is controlled by two parameters backoffLimit (default value is 6) and activeDeadlineSeconds (illimited by default). In our example we had not specified these parameters and the Pod got restarted 6 times. We can see 7 failed Pods at the end.

In terms of resources there is the regular consumption when a Pod is running. When Pods are terminated they don’t actively consume CPU or RAM resources but still use storage space on the Nodes for the logs. The resources for the Pod are not continuously reserved either. Only when the new Pod is scheduled. After the Pod has terminated the resource is released…. till a new Pod gets started. This however goes into the Scheduler work queue. If no Node with enough resource is available the Pod will need to wait.

Here it is important to have a limitation, hence the default value for backoffLimit. Without the limit you can quickly come to thousand of Pods that consume

  • storage space on the nodes for the logs
  • etcd space for the object definition
  • storage space in central logging
  • memory and storage space in Prometheus for metrics. Pod names are an unbounded set

Let’s consider impact of automation CronJobs or Jobs created by Continuous Integration (CI) tools with this scenario. I have seen clusters where there were more than hundred thousands of Jobs.

  • there were Out Of Memory (OOM) events for Prometheus
  • new Jobs were slow to start (taking 10 to 20 minutes)
  • API was slower than usual

An interesting point is that the cluster was still operational. It was possible to create Pods but creating the same Pods as part of a Job was taking a long time (not failing). It was really the creation of the Pod by the Job controller not the allocation (scheduling) of the Pod to a Node by the scheduler that was slow. The reason was the huge work queue of the Job controller. Nevertheless the Job controller may throttle new Pod creation due to excessive previous Pod failures in the same Job so that Pods for other Jobs get a chance to get created.

Note that Jobs don’t get automatically deleted after completion. It is a general good practice to remove them when they are not needed anymore as they increase the number of objects the API and etcd have to deal with. In case of CronJobs this can (and should) be done by configuring successfulJobsHistoryLimit and failedJobsHistoryLimit.

Conclusion

To conclude I will shortly mention two other scenarios.

When the service account specified in the Job does not exist the Job controller does not create a Pod and continuously retries it irrespective of the configured backofflimit.

When a PersistentVolumeClaim (PVC) is specified, which does not exist, the pod gets created but the scheduling fails. The pod stays in “Pending” state till its deletion or the creation of the PVC.

I hope this blog post provided a good overview of the components involved during Job failures, the caveats, protection mechanisms and, as the title reads, of the impact of Job failures on the cluster resources and their availability.

Thanks for reading!