kube-scheduler

kube-scheduler plays a crucial role in allocating Pods within a cluster to different nodes. It keeps an eye on the kube-apiserver, looking for Pods that have not yet been assigned a Node. Once it finds these, it allocates Nodes to them based on a set of scheduling strategies (which is achieved by updating the NodeName field of these Pods).

The scheduler takes into account a series of factors, including:

  • Fair distribution

  • Efficient utilization of resources

  • Quality of Service (QoS)

  • Affinity and anti-affinity

  • Data locality

  • Inter-workload interference

  • Deadlines

Specifying Node Scheduling

There are three ways to specify that a Pod should only run on a predetermined Node:

  • nodeSelector: Only schedules on Node that match certain labels

  • nodeAffinity: A more versatile Node selector, supports collection operations

  • podAffinity: Schedules the Pod on the Node where the condition-satisfying Pod is located.

nodeSelector Example

First, label the Node:

Then, specify nodeSelector as disktype=ssd in the daemonset:

nodeAffinity Example

nodeAffinity currently supports two modes: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. They represent the conditions that must be met and preferred conditions, respectively. The example below indicates scheduling to a Node with labels kubernetes.io/e2e-az-name and the values either e2e-az1 or e2e-az2, and preferably, the Node also carries the label another-node-label-key=another-node-label-value.

podAffinity Example

podAffinity chooses the Node based on the labels of the Pod, only schedules the Pod on the Node where the condition-satisfying Pod is located, and supports both podAffinity and podAntiAffinity.

Taints and Tolerations

Taints and Tolerations are used to ensure that a Pod is not scheduled on an unsuitable Node: Taint is applied to the Node, while Toleration is applied to the Pod.

However, a Pod can be scheduled to a specific Node when the Tolerations of the Pod match all the Taints of the Node; if the Pod is already running, it will not be removed (evicted). Note that the Pods created by DaemonSet will automatically add the NoExecute Toleration for node.alpha.kubernetes.io/unreachable and node.alpha.kubernetes.io/notReady to avoid being removed because of them.

Priority Scheduling

Starting from version 1.8, kube-scheduler supports defining the priority of a Pod, ensuring that high priority Pods are scheduled first.

Then, set the priority of the Pod in PodSpec through PriorityClassName:

Multiple Schedulers

If the default scheduler does not meet the requirements, you can deploy a custom scheduler. In the entire cluster, multiple instances of the scheduler can run at the same time, and podSpec.schedulerName is used to select which scheduler to use (the built-in scheduler is used by default).

Scheduler Extensions

Scheduler Plugins

From version 1.19, you can use the Scheduling Framework to extend the scheduler in the form of plug-ins as the figure below shows, which are the Pod scheduling context and the extension points exposed by the scheduling framework:

Scheduler Policy

kube-scheduler also supports using --policy-config-file to specify a scheduling policy file to customize the scheduling policy, such as

Other Factors Affecting Scheduling

  • If the Node Condition is in MemoryPressure, all new BestEffort Pods (those that haven't specified resource limits and requests) will not be scheduled on that Node.

  • If the Node Condition is in DiskPressure, all new Pods will not be scheduled on that Node.

  • To ensure the normal operation of Critical Pods, they will be automatically rescheduled when they are in an abnormal state. Critical Pods refer to:

    • Annotations include scheduler.alpha.kubernetes.io/critical-pod=''

    • Tolerations include [{"key":"CriticalAddonsOnly", "operator":"Exists"}]

    • PriorityClass is system-cluster-critical or system-node-critical.

Launch kube-scheduler Example

How kube-scheduler Works

kube-scheduler scheduling principle:

The kube-scheduler schedules in two phases, the predicate phase and priority phase:

  • Predicate: Filters out ineligible nodes

  • Priority: Prioritizes nodes and selects the highest priority one.

Predicate strategies include:

  • PodFitsPorts: Same as PodFitsHostPorts.

  • HostName: Checks whether pod.Spec.NodeName matches the candidate node.

  • NoVolumeZoneConflict: Checks for volume zone conflict.

  • GeneralPredicates: Divided into noncriticalPredicates and EssentialPredicates.

  • PodToleratesNodeTaints: Checks whether the Pod tolerates Node Taints.

Priority strategies include:

  • SelectorSpreadPriority: Tries to reduce the number of Pods belonging to the same Service or Replication Controller on each node.

  • NodeAffinityPriority: Tries to schedule Pods to Nodes that match NodeAffinity.

  • TaintTolerationPriority: Tries to schedule Pods to Nodes that match TaintToleration.

Reference Documents

最后更新于