Volcano High Perforamnce Workloads
kubernets native batch scheduling system
Introduction
volcano is a kubernets-based container batch engine for high-performance workload scenarios.
Application scenarios:
-
machine learning and deep learning
-
biological and genetic computing
-
big data applications
Concepts
Queue
Queue holding a set of podgroups
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: distcc
spec:
weight: 1
reclaimable: false
capability:
cpu: 50
field:
weight -> The proportion of the queue in the cluster resource division, the proportion of resources occupied by the queue is: (weight / total-weight) * total-resource, soft resource constraint.
capability -> The upper limit of the sum of resources used by all podgroups in the queue, hard constraint.
reclaimable -> If or not the queue allows other queues to reclaim the excess resources used by the queue when the resource usage exceeds the queue’s limit.
Usage Scenarios
Total Cluster CPUS = 4cores
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test1
spec:
weight: 1
---apiVersion: scheduling.volcano.sh/v1beta1
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test2
spec:
weight: 3
# Create p1 p2 podgroup belongs to test1,test2 respectively, put job1 job2 into p1 p2, resource request is 1C and 3C respectively, both jobs can work properly
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test1
spec:
weight: 1
# First create test1 queue, create podgroup p1, create job1 job2 in p1, resource allocation is 1C and 3C respectively, jobs are working properly.
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test2
spec:
weight: 3
# Create test2 queue, create podgroup p2 in this queue, create job3 resource request as 3C in p2, as test2 queue weight=3, thus job2 will be evicted and test1 3C resource will be returned to test2.
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test1
spec:
capability:
cpu: 2
# Create test1 queue, set capacity to 2, that is, the resource limit is 2C, create p1 podgroup, create job1 job2 resource request in p1 for 1C and 3C respectively, then job1 is running normally, job2 is in pending state
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test1
spec:
weight: 1
reclaimable: false
---apiVersion: scheduling.volcano.sh/v1beta1
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test2
spec:
weight: 1
# create test1 queue, reclaimable is False, that is, the queue does not return the over-occupied resources, respectively, create p1 p2 in test1 test2 queue, create job1 in p1, resource request is 3C, because the weight ratio is 1:1, at this time test1 over-occupied 1C, in create job2 in p2, the resource application is 2C, at this time, because test1 does not return the over-occupied resources, job2 will be in the pending state.
PodGroup
A podgroup is a collection of strongly associated pods for batch workload scenarios.
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: distcc
spec:
capability:
cpu: 50
reclaimable: false
weight: 1
status:
running: 1
state: Open
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
labels:
volcano.sh/job-type: COMPILER
name: distcc
namespace: eth
ownerReferences:
- apiVersion: batch.volcano.sh/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Job
name: distcc
uid: 871a0e89-9478-4369-88ae-9b9105910965
resourceVersion: "211427823"
uid: 6a610eec-e135-496c-8e4c-84e66dd15b21
spec:
minMember: 6
minResources:
cpu: "4"
queue: distcc
status:
conditions:
- lastTransitionTime: "2021-06-02T05:38:05Z"
message: '6/0 tasks in gang unschedulable: pod group is not ready, 6 minAvailable.'
reason: NotEnoughResources
status: "True"
transitionID: 2c741e36-4b2d-4fba-a35f-0be5f5d454eb
type: Unschedulable
- lastTransitionTime: "2021-06-02T05:40:30Z"
reason: tasks in gang are ready to be scheduled
status: "True"
transitionID: 1d85da8b-2a6a-4969-9aa3-8d41dab98dd7
type: Scheduled
phase: Running
running: 11
minMember -> The minimum number of pods or tasks that need to be run under this podgroup. If the cluster resources do not meet the running requirements of minMember tasks, the scheduler will not schedule any of the tasks in this podgroup.
queue -> Indicates the queue that the podgroup belongs to.
minResources -> Indicates the minimum resources needed to run the podgroup.
When a volcanoJob is created without specifying the associated podGroup, the podGroup will be created automatically with the same name as the volcanoJob
volcanoJob
volcano custom Job types
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distcc
labels:
"volcano.sh/job-type": "COMPILER"
spec:
minAvailable: 6
schedulerName: volcano
queue: distcc
plugins:
svc: []
env: []
policies:
- event: PodEvicted
action: RestartJob
volumes:
- mountPath: "/src"
volumeClaim:
accessModes: ["ReadWriteOnce"]
storageClassName: "managed-nfs-storage"
resources:
requests:
storage: 1Gi
tasks:
- replicas: 1
name: master
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- tail
-f
- /dev/null
image: 192.168.1.114:5000/distcc:k8s-2021-06-02
name: master
resources:
requests:
cpu: 4
limits:
cpu: 4
restartPolicy: OnFailure
- replicas: 10
name: worker
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- image: 192.168.1.114:5000/distcc:k8s-2021-06-02
name: worker
restartPolicy: OnFailure
schedulerName -> the scheduler used by the job, default is: volcano
minAvailable -> indicates the minimum number of pods to be run by the job
volumes -> the volumes to be mounted for the job, following the k8s volumes configuration requirements
tasks.replicas -> indicates the specific number of replicas of a task pod
tasks.template -> indicates the specific configuration definition of a task pod
tasks.policies -> indicates the lifecycle policy of a task pod
plugins -> the plugins used by the job in the scheduling process
queue -> the queue to which the job belongs
Distributed compilation
To create a distcc queue and use distcc for distributed compilation of MPICH, we need a master to start distcc compilation and some distcc worker nodes to receive compilation requests
# Dockerfile
FROM ubuntu:20.04
LABEL maintainer="[email protected]"
RUN apt-get -y update && apt-get -y upgrade
RUN apt-get install -y g++ gcc clang distcc build-essential
RUN apt-get -y -q autoremove && apt-get -y -q clean
EXPOSE 3632
CMD distccd --jobs $(nproc) --log-stderr --no-detach --daemon --allow 10.0.0.0/8 --log-level info
Create distcc queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: distcc
spec:
weight: 1
Create Job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distcc
labels:
"volcano.sh/job-type": "COMPILER"
spec:
minAvailable: 6
schedulerName: volcano
queue: distcc
plugins:
svc: []
env: []
policies:
- event: PodEvicted
action: RestartJob
volumes:
- mountPath: "/src"
volumeClaimName: distcc-data
#volumeClaim:
# accessModes: ["ReadWriteOnce"]
# storageClassName: "managed-nfs-storage"
# resources:
# requests:
# storage: 1Gi
tasks:
- replicas: 1
name: master
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/sh
-c
- command: /bin -c
cd /tmp;
echo "start..." ;
cp -v /src/mpich-3.3.2.tar.gz . /
tar xf mpich-3.3.2.tar.gz;
cd mpich-3.3.2;
export DISTCC_HOSTS="$(cat /etc/volcano/worker.host | tr '\n' ' ')";
CC=distcc CXX=distcc . /configure --disable-fortran;
make -j50;
mkdir -pv /src/mpich;
make install DESTDIR=/src/mpich;
image: 192.168.1.114:5000/distcc:k8s-2021-06-02
name: master
resources:
requests:
cpu: 4
limits:
cpu: 4
restartPolicy: OnFailure
- replicas: 10
name: worker
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- image: 192.168.1.114:5000/distcc:k8s-2021-06-02
name: worker
restartPolicy: OnFailure
After compiling, we can see the installed binary MPICh in /src/mpich.