Sculpting in time

Do one thing and do it well.
Every story has a beginning and an end.

History of Kubernets

K8s' predecessor google Borg

7-Minute Read

Kubernets Wiki

Recently read [Large-scale cluster management at Google with Borg](https://static.googleusercontent.com/media/research.google.com/en//pubs/ archive/43438.pdf), Borg is Google’s own internal development of a set of cluster management system, and then the open source community slowly began to incubate and launch the Kubernets cluster management system, k8s and Borg has a very deep relationship of origin.

Starting with Kubernets, let’s look at the components of K8s:

  1. Kubelet

    Kubelet runs on every node and manages the containers on the node.

  2. kube-apiServer

    Validate and configure data about Pods, Services, ReplicationControllers, etc. Other kubes need to call the api to get the required data.

  3. kube-Controller-Manager

    Used to control the state of the system, synchronize the system data through API interface, for example, synchronize the system from state A to state B.

  4. kube-proxy

    Runs on each node and acts as a network proxy, which can simply handle TCP UDP forwarding and polling.

  5. kube-scheduler

    is used to schedule Pods. When new Pods need to be created, the scheduler will score each working node and assign Pods to the nodes that meet the requirements.

Then look at the description of the Borg system, each cluster is called a Cell, and each Cell runs in a separate data center.

Borg Architecture

The Borg design guidelines are the following three main points:

  1. Hide the management of the system’s underlying resources and fault tolerance, so that developers can focus on development without paying too much attention to the underlying details.

  2. the applications running in the system are highly available, the downtime of a node will not cause service unavailability.

  3. must be very efficient in tens of thousands of nodes to complete the scaling of the application.

Borg users are basically google internal developers and SRE, it is because google focus on the development of SRE, slowly formed the above three points of cognitive system within google, mentioned google SRE have to mention GOOGLE SRE How Google Runs Production Systems this book, I have only read the Chinese version, after reading the English version will write a blog dedicated to discuss the contents of this book in detail.

Cluster design architecture

The Borg system has its own configuration file syntax system BCL, which is then loaded into a distributed database store based on Paxos via borgcfg. In k8s we are writing the Yaml state configuration file and then using kubectl to submit the state configuration to the back-end distributed ETCD store via the kube-api interface, in the edge facility k8s scenario, like k3s, microk8s will use SQL-based database engine Dqlite by default. dqlite also uses Raft consistent storage protocol, so that k8s on edge devices can be highly available.

In the Borg system, SRE engineers access and modify the system mostly through the Borgctl command line tool.

The core components of Borg are Scheduler, BorgMaster, Paxos Datastore, LinkShard, Borglet.

BorgMaster: master is divided into two parts, the first part is to handle external requests, such as RPC calls to obtain cluster status, task assignment information, and borglet status, the second part is used to handle the internal task scheduling module of the cluster, master supports high availability, all data is stored through the Paxos-based database. After the current master leader is down, a new master node will be elected in the master replica to continue to provide services to the outside world. The most compelling thing is that the master node uses CheckPoint technology for data storage, and every once in a while, the system will automatically save the current state to the checkpoint, so that when the cluster has a problem, we can roll back the cluster to the previous normal checkpoint state, and use these points offline to do debug.

BorgScheduler: When a task is submitted through master, master stores the data in paxos and puts the task in the queue to be processed, Scheduler will process these tasks according to their priority. After the Scheduler gets the tasks, it calculates which machines meet the requirements, scores these machines according to a certain scoring algorithm, and then assigns the tasks to the corresponding nodes according to the scores.

Borglet: Runs on each node and is used to manage the life cycle of tasks, such as stopping and restarting, maintaining the kernel configuration of the operating system. borgmaster will periodically pull the status information on the let, and if the let does not answer then master will assume that the corresponding node is offline or down.

Jobs&&Tasks: Borgs run objects to Jobs Tasks to manage, a Job can be composed of multiple Tasks. The following is the lifecycle of a Task.

When a task is submitted to the Borg Master, if it is accepted by the master, it enters the Pending state and waits for the Scheduler to schedule it, after scheduling is complete, the constituent job’s Tasks start running, if the system’s resource limit is exceeded, the Tasks are Evicted and then enter the Pending state When the Tasks are finished running, the status of the task will be updated to Dead, which is a bit like the life cycle of a process.

Then we look at the whole architecture of k8s and how the scheduling of tasks is done, which can be compared with Borg to find the advantages and disadvantages of both.

K8s architecture

Control Plane. 1:

  1. kube-apiServer -> All requests for data processing and storage need to be done through the api

  2. kube-scheduler -> find the machines that meet the current requirements through certain algorithms, and assign the task to one of the nodes that meet the requirements, and record the data in the etcd through the api

  3. kube-controller-manager -> get the task information from API interface, monitor the task status to be close to the expected definition, and if it is found that it does not meet the expectation, the API interface will be notified to create or destroy the task

  4. etcd -> Distributed storage system

Node. 1:

  1. Pods -> the smallest task runtime unit in k8s

  2. container runtime engine -> backend container engine, such as Docker,Containerd

  3. kubelet -> communicates with the API and manages the lifecycle of tasks on its own nodes

  4. kube-proxy -> sets up node routing information so that task applications can be accessed externally

It is easy to see that kubernets is a bit like an upgraded version of Borg, for example, Borg uses Linux choot jail to isolate the task runtime environment, while k8s uses more advanced container technology to isolate the runtime environment. k8s has a more detailed split, splitting the master service into multiple microservices, which can The k8s is split into multiple microservices, which can increase the request processing capacity of the master node more effectively.


google internal use of Borg has been a long time, Borg developers realize that Borg itself will have many defects, so they design k8s, will try to improve the architecture of this set of cluster management system, making this system more useful, this is the current k8s more and more popular reason, I personally think k8s oriented developers, operations and maintenance may be to But to efficiently put k8s into production between operations and development, we need to do a lot of custom development, so that the cluster itself operates in line with the company’s internal business architecture, such as automatic handling of application failures, automatic expansion of nodes, these are all related to business logic, only consider the operations level is no way to play k8s powerful cluster management capabilities So by reading Borg’s paper, we can better take the advantages and disadvantages mentioned in the paper into account for our own production system, and then tailor the system according to our own logic.


Deploying the cluster


    keyFile: /etc/etcd/etcd-server.key

kind: ClusterConfiguration
kubernetesVersion: v1.19.1
  dnsDomain: cluster.local
      scheduler: {}~

Upgrade the cluster

upgrade master node

kubeadm upgrade apply v1.20.0

upgrade worker node

sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get update
sudo apt-get install kubeadm kubelet kubectl
kubectl drain <node-to-drain> --ignore-daemonsets
sudo kubeadm upgrade node
sudo systemctl daemon-reload
sudo systemctl restart kubelet
kubectl uncordon <node-to-drain>
sudo apt-mark hold kubeadm kubelet kubectl

Recent Posts



Keep thinking, Stay curious
Always be sensitive to new things