Kubernetes指南
Linux性能优化实战eBPF 核心技术与实战SDN指南个人博客
EN
EN
  • Overview
  • Introduction
    • Kubernetes Introduction
    • Kubernetes Concepts
    • Kubernetes 101
    • Kubernetes 201
    • Kubernetes Cluster
  • Concepts
    • Concepts
    • Architecture
    • Design Principles
    • Components
      • etcd
      • kube-apiserver
      • kube-scheduler
      • kube-controller-manager
      • kubelet
      • kube-proxy
      • kube-dns
      • Federation
      • kubeadm
      • hyperkube
      • kubectl
    • Objects
      • Autoscaling
      • ConfigMap
      • CronJob
      • CustomResourceDefinition
      • DaemonSet
      • Deployment
      • Ingress
      • Job
      • LocalVolume
      • Namespace
      • NetworkPolicy
      • Node
      • PersistentVolume
      • Pod
      • PodPreset
      • ReplicaSet
      • Resource Quota
      • Secret
      • SecurityContext
      • Service
      • ServiceAccount
      • StatefulSet
      • Volume
  • Setup
    • Setup Guidance
    • kubectl Install
    • Single Machine
    • Feature Gates
    • Best Practice
    • Version Support
    • Setup Cluster
      • kubeadm
      • kops
      • Kubespray
      • Azure
      • Windows
      • LinuxKit
      • kubeasz
    • Setup Addons
      • Addon-manager
      • DNS
      • Dashboard
      • Monitoring
      • Logging
      • Metrics
      • GPU
      • Cluster Autoscaler
      • ip-masq-agent
  • Extension
    • API Extension
      • Aggregation
      • CustomResourceDefinition
    • Access Control
      • Authentication
      • RBAC Authz
      • Admission
    • Scheduler Extension
    • Network Plugin
      • CNI
      • Flannel
      • Calico
      • Weave
      • Cilium
      • OVN
      • Contiv
      • SR-IOV
      • Romana
      • OpenContrail
      • Kuryr
    • Container Runtime
      • CRI-tools
      • Frakti
    • Storage Driver
      • CSI
      • FlexVolume
      • glusterfs
    • Network Policy
    • Ingress Controller
      • Ingress + Letsencrypt
      • minikube Ingress
      • Traefik Ingress
      • Keepalived-VIP
    • Cloud Provider
    • Device Plugin
  • Cloud Native Apps
    • Apps Management
      • Patterns
      • Rolling Update
      • Helm
      • Operator
      • Service Mesh
      • Linkerd
      • Linkerd2
    • Istio
      • Deploy
      • Traffic Management
      • Security
      • Policy
      • Metrics
      • Troubleshooting
      • Community
    • Devops
      • Draft
      • Jenkins X
      • Spinnaker
      • Kompose
      • Skaffold
      • Argo
      • Flux GitOps
  • Practices
    • Overview
    • Resource Management
    • Cluster HA
    • Workload HA
    • Debugging
    • Portmap
    • Portforward
    • User Management
    • GPU
    • HugePage
    • Security
    • Audit
    • Backup
    • Cert Rotation
    • Large Cluster
    • Big Data
      • Spark
      • Tensorflow
    • Serverless
  • Troubleshooting
    • Overview
    • Cluster Troubleshooting
    • Pod Troubleshooting
    • Network Troubleshooting
    • PV Troubleshooting
      • AzureDisk
      • AzureFile
    • Windows Troubleshooting
    • Cloud Platform Troubleshooting
      • Azure
    • Troubleshooting Tools
  • Community
    • Development Guide
    • Unit Test and Integration Test
    • Community Contribution
  • Appendix
    • Ecosystem
    • Learning Resources
    • Domestic Mirrors
    • How to Contribute
    • Reference Documents
由 GitBook 提供支持
在本页
  • Specifying Node Scheduling
  • nodeSelector Example
  • nodeAffinity Example
  • podAffinity Example
  • Taints and Tolerations
  • Priority Scheduling
  • Multiple Schedulers
  • Scheduler Extensions
  • Scheduler Plugins
  • Scheduler Policy
  • Other Factors Affecting Scheduling
  • Launch kube-scheduler Example
  • How kube-scheduler Works
  • Reference Documents
  1. Concepts
  2. Components

kube-scheduler

kube-scheduler plays a crucial role in allocating Pods within a cluster to different nodes. It keeps an eye on the kube-apiserver, looking for Pods that have not yet been assigned a Node. Once it finds these, it allocates Nodes to them based on a set of scheduling strategies (which is achieved by updating the NodeName field of these Pods).

The scheduler takes into account a series of factors, including:

  • Fair distribution

  • Efficient utilization of resources

  • Quality of Service (QoS)

  • Affinity and anti-affinity

  • Data locality

  • Inter-workload interference

  • Deadlines

Specifying Node Scheduling

There are three ways to specify that a Pod should only run on a predetermined Node:

  • nodeSelector: Only schedules on Node that match certain labels

  • nodeAffinity: A more versatile Node selector, supports collection operations

  • podAffinity: Schedules the Pod on the Node where the condition-satisfying Pod is located.

nodeSelector Example

First, label the Node:

kubectl label nodes node-01 disktype=ssd

Then, specify nodeSelector as disktype=ssd in the daemonset:

spec:
  nodeSelector:
    disktype: ssd

nodeAffinity Example

nodeAffinity currently supports two modes: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. They represent the conditions that must be met and preferred conditions, respectively. The example below indicates scheduling to a Node with labels kubernetes.io/e2e-az-name and the values either e2e-az1 or e2e-az2, and preferably, the Node also carries the label another-node-label-key=another-node-label-value.

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  ...

podAffinity Example

podAffinity chooses the Node based on the labels of the Pod, only schedules the Pod on the Node where the condition-satisfying Pod is located, and supports both podAffinity and podAntiAffinity.

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  ...

Taints and Tolerations

Taints and Tolerations are used to ensure that a Pod is not scheduled on an unsuitable Node: Taint is applied to the Node, while Toleration is applied to the Pod.

kubectl taint nodes node1 key1=value1:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key2=value2:NoSchedule

However, a Pod can be scheduled to a specific Node when the Tolerations of the Pod match all the Taints of the Node; if the Pod is already running, it will not be removed (evicted). Note that the Pods created by DaemonSet will automatically add the NoExecute Toleration for node.alpha.kubernetes.io/unreachable and node.alpha.kubernetes.io/notReady to avoid being removed because of them.

Priority Scheduling

Starting from version 1.8, kube-scheduler supports defining the priority of a Pod, ensuring that high priority Pods are scheduled first.

apiVersion: v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."

Then, set the priority of the Pod in PodSpec through PriorityClassName:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority

Multiple Schedulers

If the default scheduler does not meet the requirements, you can deploy a custom scheduler. In the entire cluster, multiple instances of the scheduler can run at the same time, and podSpec.schedulerName is used to select which scheduler to use (the built-in scheduler is used by default).

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  # Choose to use the custom scheduler my-scheduler
  schedulerName: my-scheduler
  containers:
  - name: nginx
    image: nginx:1.10

Scheduler Extensions

Scheduler Plugins

Scheduler Policy

kube-scheduler also supports using --policy-config-file to specify a scheduling policy file to customize the scheduling policy, such as

{
      ...
    ]
}

Other Factors Affecting Scheduling

  • If the Node Condition is in MemoryPressure, all new BestEffort Pods (those that haven't specified resource limits and requests) will not be scheduled on that Node.

  • If the Node Condition is in DiskPressure, all new Pods will not be scheduled on that Node.

  • To ensure the normal operation of Critical Pods, they will be automatically rescheduled when they are in an abnormal state. Critical Pods refer to:

    • Annotations include scheduler.alpha.kubernetes.io/critical-pod=''

    • Tolerations include [{"key":"CriticalAddonsOnly", "operator":"Exists"}]

    • PriorityClass is system-cluster-critical or system-node-critical.

Launch kube-scheduler Example

kube-scheduler --address=127.0.0.1 --leader-elect=true --kubeconfig=/etc/kubernetes/scheduler.conf

How kube-scheduler Works

kube-scheduler scheduling principle:

For given pod:
    ...

The kube-scheduler schedules in two phases, the predicate phase and priority phase:

  • Predicate: Filters out ineligible nodes

  • Priority: Prioritizes nodes and selects the highest priority one.

Predicate strategies include:

  • PodFitsPorts: Same as PodFitsHostPorts.

  • HostName: Checks whether pod.Spec.NodeName matches the candidate node.

  • NoVolumeZoneConflict: Checks for volume zone conflict.

  • GeneralPredicates: Divided into noncriticalPredicates and EssentialPredicates.

  • PodToleratesNodeTaints: Checks whether the Pod tolerates Node Taints.

Priority strategies include:

  • SelectorSpreadPriority: Tries to reduce the number of Pods belonging to the same Service or Replication Controller on each node.

  • NodeAffinityPriority: Tries to schedule Pods to Nodes that match NodeAffinity.

  • TaintTolerationPriority: Tries to schedule Pods to Nodes that match TaintToleration.

Reference Documents

上一页kube-apiserver下一页kube-controller-manager

最后更新于1年前

From version 1.19, you can use the to extend the scheduler in the form of plug-ins as the figure below shows, which are the Pod scheduling context and the extension points exposed by the scheduling framework:

Scheduling Framework
Pod Priority and Preemption
Configure Multiple Schedulers
Taints and Tolerations
Advanced Scheduling in Kubernetes