State Management

Homelab Autoscaler uses a finite state machine (FSM) for reliable node power state management.

FSM Overview

The FSM manages node power transitions using the looplab/fsm library with coordination locks to prevent race conditions.

Shutdown ←→ StartingUp ←→ Ready ←→ ShuttingDown
                      \\              /
                       \\            /
                        JobFailed

States

State	Description
Shutdown	Node is powered off, ready for startup
StartingUp	Node startup operation in progress
Ready	Node is powered on and operational
ShuttingDown	Node shutdown operation in progress

State Transitions

Startup Flow

StartNode event → Shutdown → StartingUp
Startup job creates Kubernetes Job
JobCompleted → StartingUp → Ready
JobFailed → StartingUp → Shutdown (retry)

Shutdown Flow

ShutdownNode event → Ready → ShuttingDown
Shutdown job creates Kubernetes Job
JobCompleted → ShuttingDown → Shutdown
JobFailed → ShuttingDown → Ready (retry)

Coordination Locks

The FSM uses Kubernetes annotations for coordination:

annotations:
  homelab-autoscaler.dev/operation-lock: "scale-up"
  homelab-autoscaler.dev/lock-owner: "node-controller"
  homelab-autoscaler.dev/lock-timestamp: "2024-01-01T00:00:00Z"
  homelab-autoscaler.dev/lock-timeout: "5m"

Lock Lifecycle

Acquired in before_ hooks before state transitions
Released in after_ hooks after successful transitions
Prevents race conditions between controllers
5-minute timeout for stuck operations

Async Operations

Power operations use Kubernetes Jobs for async execution:

Startup Job

apiVersion: batch/v1
kind: Job
metadata:
  name: node-startup-{node-name}
spec:
  template:
    spec:
      containers:
      - name: power-manager
        image: homelab/power-manager:latest
        command: ["wake-on-lan"]
        args: ["00:11:22:33:44:55"]

Shutdown Job

apiVersion: batch/v1
kind: Job
metadata:
  name: node-shutdown-{node-name}
spec:
  template:
    spec:
      containers:
      - name: power-manager
        image: homelab/power-manager:latest
        command: ["ssh-shutdown"]
        args: ["worker-01.local"]

Error Handling

The FSM provides robust error recovery:

Job Failures

Failed startup jobs → Shutdown state (ready for retry)
Failed shutdown jobs → Ready state (ready for retry)
Automatic retry with backoff strategy

Timeout Handling

5-minute timeout for job completion
State-aware backoff (30s → 2m → 5m)
Force cleanup after 15 minutes of stuck state

Coordination Conflicts

FSM prevents invalid concurrent transitions
Clear error reporting and retry logic
Automatic lock cleanup

Integration Points

Controller Integration

// NodeReconciler uses FSM for state management
func (r *NodeReconciler) Reconcile(ctx context.Context, req ctrl.Request) {
    fsm := fsm.NewNodeStateMachine(node, r.Client)

    if node.Spec.DesiredPowerState == "on" {
        fsm.StartNode()
    } else {
        fsm.ShutdownNode()
    }
}

gRPC Server Integration

// Cluster Autoscaler triggers FSM events
func (s *Server) NodeGroupIncreaseSize(ctx context.Context, req *pb.Request) {
    fsm := fsm.NewNodeStateMachine(node, s.Client)
    fsm.StartNode()
}

Benefits

Reliability

Formal state transitions prevent invalid operations
Coordination locks prevent race conditions
Automatic error recovery and retry

Observability

Clear current state and valid transitions
State transition logging and metrics
Easy debugging of stuck operations

Flexibility

Extensible FSM architecture
Customizable timeout and backoff
Support for various power management methods

Configuration

Timeout Settings

Job timeout: 5 minutes (configurable)
Lock timeout: 5 minutes (configurable)
Backoff strategy: 30s → 2m → 5m

Power Methods

Supported power control methods:

Wake-on-LAN
IPMI/BMC interfaces
Smart PDUs
Custom scripts via SSH
Hardware-specific protocols

Architecture Overview - System design
Node CRD - Node resource definition
Development Setup - FSM implementation details

FSM Overview​

States​

State Transitions​

Startup Flow​

Shutdown Flow​

Coordination Locks​

Lock Lifecycle​

Async Operations​

Startup Job​

Shutdown Job​

Error Handling​

Job Failures​

Timeout Handling​

Coordination Conflicts​

Integration Points​

Controller Integration​

gRPC Server Integration​

Benefits​

Reliability​

Observability​

Flexibility​

Configuration​

Timeout Settings​

Power Methods​

Related Documentation​