High Availability Configuration

Follow this guide to ensure the highest availability possible for a Testkube deployment.

Replicas

Most of the components support multiple replicas. Specify the proper topologySpreadConstraints to schedule each pod in a different availability/fault zone within your infrastructure to provide fault tolerance against zonal failure.

Priority Class

To ensure that Testkube pods are not preempted by other less important workloads, make sure to create an appropriate priority class for the Testkube pods and specify it in the corresponding priorityClassName values.

Dedicated Nodes

Resources must be available across multiple availibility/fault zones to ensure Testkube pods can always respawn on a different node/zone in case of failure.

The example configurations below contain tolerations which assume dedicated nodes have been allocated in several availability zones for Testkube. The examples assume the nodes were tainted with a testkube key and a NoSchedule effect.

Example Configuration

Operator and Agent

Example values for the testkube chart for the agent and operator:

testkube-api:
  # Create a priority class for Testkube, i.e. high
  priorityClassName: high
  tolerations:
  - key: testkube
    operator: Exists
    effect: NoSchedule
testkube-operator:
  priorityClassName: high
  # Dedicate nodes to Testkube, i.e. taint with testkube key
  tolerations:
  - key: testkube
    operator: Exists
    effect: NoSchedule
nats:
  config:
    cluster:
      enabled: true
      replicas: 3
  podTemplate:
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: zone
      whenUnsatisfiable: DoNotSchedule

Control Plane

Values example for the testkube-enterprise chart for the control plane:

testkube-cloud-api:
  replicaCount: 3
  # Spread replicas across zones
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: testkube-cloud-api
  priorityClassName: high
  tolerations:
  - key: testkube
    operator: Exists
    effect: NoSchedule
testkube-cloud-ui:
  replicaCount: 3
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: testkube-cloud-ui
  priorityClassName: high
  tolerations:
  - key: testkube
    operator: Exists
    effect: NoSchedule
testkube-worker-service:
  priorityClassName: high
  tolerations:
  - key: testkube
    operator: Exists
    effect: NoSchedule
dex:
  replicaCount: 3
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: dex
  priorityClassName: high
  tolerations:
  - key: testkube
    operator: Exists
    effect: NoSchedule
nats:
  config:
    cluster:
      enabled: true
      replicas: 3
  podTemplate:
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: zone
      whenUnsatisfiable: DoNotSchedule

Caveats

The agent under the current architecture can only run as a single instance. Coordinating multiple replicas would require implementing leader election, but electing a new leader would mostly likely take longer than spawning a new pod on a different node and reconnecting.
The operator can only run as a single instance, but it is responsible for running a periodic reconciliation process which at most could be delayed while a new pod spawns on a different node.
Setting up MongoDB for high availability is outside of the scope of this guide, but in production deployments we highly recommend utilizing a managed service such as MongoDB Atlas as an external MongoDB cluster.
Dex should be backed by an highly available data storage like etcd.
The NATS chart is currently missing the ability to specify tolerations and priorityClassName.
The worker service could possibly work with multiple replicas, but this has not been verified in a production environment.

Replicas​

Priority Class​

Dedicated Nodes​

Example Configuration​

Operator and Agent​

Control Plane​

Caveats​