Skip to main content

Group CRD Reference

The Group Custom Resource Definition (CRD) defines autoscaling policies for collections of physical nodes.

API Version

  • Group: infra.homecluster.dev
  • Version: v1alpha1
  • Kind: Group

Specification

GroupSpec

FieldTypeRequiredDescription
namestringYesUnique identifier for the group
scaleDownUtilizationThresholdfloat64YesCPU utilization threshold for scale-down (0.0-1.0)
scaleDownGpuUtilizationThresholdfloat64YesGPU utilization threshold for scale-down (0.0-1.0)
scaleDownUnneededTime*metav1.DurationYesTime a node must be unneeded before scale-down
scaleDownUnreadyTime*metav1.DurationYesTime an unready node waits before scale-down
maxNodeProvisionTime*metav1.DurationYesMaximum time to wait for node provisioning
zeroOrMaxNodeScalingboolYesWhether to scale to zero or max nodes only
ignoreDaemonSetsUtilizationboolYesIgnore DaemonSet pods in utilization calculations

Example

apiVersion: infra.homecluster.dev/v1alpha1
kind: Group
metadata:
name: worker-nodes
namespace: homelab-autoscaler-system
spec:
name: worker-nodes
maxSize: 5
nodeSelector:
node-type: worker
zone: homelab
scaleDownUtilizationThreshold: 0.5
scaleDownGpuUtilizationThreshold: 0.5
scaleDownUnneededTime: "10m"
scaleDownUnreadyTime: "20m"
maxNodeProvisionTime: "15m"
zeroOrMaxNodeScaling: false
ignoreDaemonSetsUtilization: true

Usage

Creating a Group

  1. Set scaling thresholds for CPU and GPU utilization
  2. Configure timing parameters for scale-down behavior
  3. Apply the Group resource to your cluster
kubectl apply -f group.yaml

Monitoring Group Status

# Check group status
kubectl get groups -n homelab-autoscaler-system

# Get detailed group information
kubectl describe group worker-nodes -n homelab-autoscaler-system

# Watch group status changes
kubectl get groups -n homelab-autoscaler-system -w

Scaling Behavior

Scale-Up Triggers

  • Pod scheduling failures due to insufficient resources
  • CPU/memory pressure on existing nodes
  • External scaling requests via gRPC API

Scale-Down Triggers

  • Node utilization below scaleDownUtilizationThreshold
  • GPU utilization below scaleDownGpuUtilizationThreshold
  • Node unneeded for longer than scaleDownUnneededTime
  • Node unready for longer than scaleDownUnreadyTime

Timing Parameters

ParameterPurposeTypical Value
scaleDownUnneededTimePrevents flapping by requiring sustained low utilization10m
scaleDownUnreadyTimeRemoves unresponsive nodes20m
maxNodeProvisionTimeTimeout for node startup operations15m

Best Practices

Resource Planning

  • Account for startup time in maxNodeProvisionTime

Threshold Tuning

  • Start with conservative thresholds (0.5-0.7) and adjust based on workload patterns
  • Monitor actual utilization patterns before optimizing thresholds
  • Consider workload characteristics (CPU vs memory intensive)

Troubleshooting

Common Issues

  1. Nodes not joining group

    • Check that group label is set on Node CRDs
  2. Scaling not working

    • Verify gRPC server is running and accessible
    • Check controller logs for errors

Debugging Commands

# Check group controller logs
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager

# List nodes in group
kubectl get nodes -l group=worker-nodes

# Check node CRDs
kubectl get nodes.infra.homecluster.dev -l group=worker-nodes

# Verify gRPC server status
kubectl port-forward -n homelab-autoscaler-system service/homelab-autoscaler-grpc 50051:50051