High Availability Configuration
Follow this guide to ensure the highest availability possible for a Testkube deployment.
Replicas
Most of the components support multiple replicas. Specify the proper
topologySpreadConstraints
to schedule each pod in a different
availability/fault zone within your infrastructure to provide fault tolerance
against zonal failure.
Priority Class
To ensure that Testkube pods are not preempted by other less important
workloads, make sure to create an appropriate priority
class
for the Testkube pods and specify it in the corresponding priorityClassName
values.
Dedicated Nodes
Resources must be available across multiple availibility/fault zones to ensure Testkube pods can always respawn on a different node/zone in case of failure.
The example configurations below contain
tolerations
which assume dedicated nodes have been allocated in several availability zones
for Testkube. The examples assume the nodes were tainted with a testkube
key and
a NoSchedule
effect.
Example Configuration
Operator and Agent
Example values for the testkube
chart for the agent and operator:
testkube-api:
# Create a priority class for Testkube, i.e. high
priorityClassName: high
tolerations:
- key: testkube
operator: Exists
effect: NoSchedule
testkube-operator:
priorityClassName: high
# Dedicate nodes to Testkube, i.e. taint with testkube key
tolerations:
- key: testkube
operator: Exists
effect: NoSchedule
nats:
config:
cluster:
enabled: true
replicas: 3
podTemplate:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
Control Plane
Values example for the testkube-enterprise
chart for the control plane:
testkube-cloud-api:
replicaCount: 3
# Spread replicas across zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: testkube-cloud-api
priorityClassName: high
tolerations:
- key: testkube
operator: Exists
effect: NoSchedule
testkube-cloud-ui:
replicaCount: 3
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: testkube-cloud-ui
priorityClassName: high
tolerations:
- key: testkube
operator: Exists
effect: NoSchedule
testkube-worker-service:
priorityClassName: high
tolerations:
- key: testkube
operator: Exists
effect: NoSchedule
dex:
replicaCount: 3
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: dex
priorityClassName: high
tolerations:
- key: testkube
operator: Exists
effect: NoSchedule
nats:
config:
cluster:
enabled: true
replicas: 3
podTemplate:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
Caveats
- The agent under the current architecture can only run as a single instance. Coordinating multiple replicas would require implementing leader election, but electing a new leader would mostly likely take longer than spawning a new pod on a different node and reconnecting.
- The operator can only run as a single instance, but it is responsible for running a periodic reconciliation process which at most could be delayed while a new pod spawns on a different node.
- Setting up MongoDB for high availability is outside of the scope of this guide, but in production deployments we highly recommend utilizing a managed service such as MongoDB Atlas as an external MongoDB cluster.
- Dex should be backed by an highly available data storage like etcd.
- The NATS chart is currently missing the ability to specify
tolerations
andpriorityClassName
. - The worker service could possibly work with multiple replicas, but this has not been verified in a production environment.