Volcano High Perforamnce Workloads

Introduction

volcano is a kubernets-based container batch engine for high-performance workload scenarios.

Application scenarios:

machine learning and deep learning
biological and genetic computing
big data applications

Concepts

Queue

Queue holding a set of podgroups

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: distcc
spec:
  weight: 1
  reclaimable: false
  capability:
    cpu: 50

field:

weight -> The proportion of the queue in the cluster resource division, the proportion of resources occupied by the queue is: (weight / total-weight) * total-resource, soft resource constraint.

capability -> The upper limit of the sum of resources used by all podgroups in the queue, hard constraint.

reclaimable -> If or not the queue allows other queues to reclaim the excess resources used by the queue when the resource usage exceeds the queue’s limit.

Usage Scenarios

Total Cluster CPUS = 4cores

---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test1
spec:
  weight: 1

---apiVersion: scheduling.volcano.sh/v1beta1
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test2
spec:
  weight: 3

# Create p1 p2 podgroup belongs to test1,test2 respectively, put job1 job2 into p1 p2, resource request is 1C and 3C respectively, both jobs can work properly

---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test1
spec:
  weight: 1

# First create test1 queue, create podgroup p1, create job1 job2 in p1, resource allocation is 1C and 3C respectively, jobs are working properly.

---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test2
spec:
  weight: 3

# Create test2 queue, create podgroup p2 in this queue, create job3 resource request as 3C in p2, as test2 queue weight=3, thus job2 will be evicted and test1 3C resource will be returned to test2.

---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test1
spec:
  capability:
    cpu: 2

# Create test1 queue, set capacity to 2, that is, the resource limit is 2C, create p1 podgroup, create job1 job2 resource request in p1 for 1C and 3C respectively, then job1 is running normally, job2 is in pending state

---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test1
spec:
  weight: 1
  reclaimable: false
---apiVersion: scheduling.volcano.sh/v1beta1
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test2
spec:
  weight: 1

# create test1 queue, reclaimable is False, that is, the queue does not return the over-occupied resources, respectively, create p1 p2 in test1 test2 queue, create job1 in p1, resource request is 3C, because the weight ratio is 1:1, at this time test1 over-occupied 1C, in create job2 in p2, the resource application is 2C, at this time, because test1 does not return the over-occupied resources, job2 will be in the pending state.

PodGroup

A podgroup is a collection of strongly associated pods for batch workload scenarios.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: distcc
spec:
  capability:
    cpu: 50
  reclaimable: false
  weight: 1
status:
  running: 1
  state: Open

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  labels:
    volcano.sh/job-type: COMPILER
  name: distcc
  namespace: eth
  ownerReferences:
  - apiVersion: batch.volcano.sh/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: distcc
    uid: 871a0e89-9478-4369-88ae-9b9105910965
  resourceVersion: "211427823"
  uid: 6a610eec-e135-496c-8e4c-84e66dd15b21
spec:
  minMember: 6
  minResources:
    cpu: "4"
  queue: distcc
status:
  conditions:
  - lastTransitionTime: "2021-06-02T05:38:05Z"
    message: '6/0 tasks in gang unschedulable: pod group is not ready, 6 minAvailable.'
    reason: NotEnoughResources
    status: "True"
    transitionID: 2c741e36-4b2d-4fba-a35f-0be5f5d454eb
    type: Unschedulable
  - lastTransitionTime: "2021-06-02T05:40:30Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: 1d85da8b-2a6a-4969-9aa3-8d41dab98dd7
    type: Scheduled
  phase: Running
  running: 11

minMember -> The minimum number of pods or tasks that need to be run under this podgroup. If the cluster resources do not meet the running requirements of minMember tasks, the scheduler will not schedule any of the tasks in this podgroup.

queue -> Indicates the queue that the podgroup belongs to.

minResources -> Indicates the minimum resources needed to run the podgroup.

When a volcanoJob is created without specifying the associated podGroup, the podGroup will be created automatically with the same name as the volcanoJob

volcanoJob

volcano custom Job types

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distcc
  labels:
    "volcano.sh/job-type": "COMPILER"
spec:
  minAvailable: 6
  schedulerName: volcano
  queue: distcc
  plugins:
    svc: []
    env: []
  policies:
    - event: PodEvicted
      action: RestartJob
  volumes:
    - mountPath: "/src"
      volumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "managed-nfs-storage"
        resources:
          requests:
            storage: 1Gi
  tasks:
    - replicas: 1
      name: master
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - tail
                -f
                - /dev/null
              image: 192.168.1.114:5000/distcc:k8s-2021-06-02
              name: master
              resources:
                requests:
                  cpu: 4
                limits:
                  cpu: 4
          restartPolicy: OnFailure
    - replicas: 10
      name: worker
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - image: 192.168.1.114:5000/distcc:k8s-2021-06-02
              name: worker
          restartPolicy: OnFailure

schedulerName -> the scheduler used by the job, default is: volcano

minAvailable -> indicates the minimum number of pods to be run by the job

volumes -> the volumes to be mounted for the job, following the k8s volumes configuration requirements

tasks.replicas -> indicates the specific number of replicas of a task pod

tasks.template -> indicates the specific configuration definition of a task pod

tasks.policies -> indicates the lifecycle policy of a task pod

plugins -> the plugins used by the job in the scheduling process

queue -> the queue to which the job belongs

Distributed compilation

To create a distcc queue and use distcc for distributed compilation of MPICH, we need a master to start distcc compilation and some distcc worker nodes to receive compilation requests

# Dockerfile
FROM ubuntu:20.04
LABEL maintainer="[email protected]"

RUN apt-get -y update && apt-get -y upgrade
RUN apt-get install -y g++ gcc clang distcc build-essential
RUN apt-get -y -q autoremove && apt-get -y -q clean

EXPOSE 3632

CMD distccd --jobs $(nproc) --log-stderr --no-detach --daemon --allow 10.0.0.0/8 --log-level info

Create distcc queue

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: distcc
spec:
  weight: 1

Create Job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distcc
  labels:
    "volcano.sh/job-type": "COMPILER"
spec:
  minAvailable: 6
  schedulerName: volcano
  queue: distcc
  plugins:
    svc: []
    env: []
  policies:
    - event: PodEvicted
      action: RestartJob
  volumes:
    - mountPath: "/src"
      volumeClaimName: distcc-data
      #volumeClaim:
      # accessModes: ["ReadWriteOnce"]
      # storageClassName: "managed-nfs-storage"
      # resources:
      # requests:
      # storage: 1Gi
  tasks:
    - replicas: 1
      name: master
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                -c
                - command: /bin -c
                  cd /tmp;
                  echo "start..." ;
                  cp -v /src/mpich-3.3.2.tar.gz . /
                  tar xf mpich-3.3.2.tar.gz;
                  cd mpich-3.3.2;
                  export DISTCC_HOSTS="$(cat /etc/volcano/worker.host | tr '\n' ' ')";
                  CC=distcc CXX=distcc . /configure --disable-fortran;
                  make -j50;
                  mkdir -pv /src/mpich;
                  make install DESTDIR=/src/mpich;
              image: 192.168.1.114:5000/distcc:k8s-2021-06-02
              name: master
              resources:
                requests:
                  cpu: 4
                limits:
                  cpu: 4
          restartPolicy: OnFailure
    - replicas: 10
      name: worker
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - image: 192.168.1.114:5000/distcc:k8s-2021-06-02
              name: worker
          restartPolicy: OnFailure

After compiling, we can see the installed binary MPICh in /src/mpich.

Introduction#

Concepts#

Queue#

Usage Scenarios#

PodGroup#

volcanoJob#

Distributed compilation#