Debugging Guide
This guide helps you debug issues with the homelab-autoscaler system.
⚠️ IMPORTANT: Many issues are due to Known Issues. Check there first.
General Debugging Approach
1. Check System Status
# Check if controller is running
kubectl get pods -n homelab-autoscaler-system
# Check CRD installation
kubectl get crds | grep homecluster
# Check resource status
kubectl get groups -n homelab-autoscaler-system
kubectl get nodes.infra.homecluster.dev -n homelab-autoscaler-system
2. Examine Logs
# Controller logs
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager
# Follow logs in real-time
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager -f
# Get logs from specific container
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager -c manager
3. Check Resource Details
# Describe resources for detailed information
kubectl describe group <group-name> -n homelab-autoscaler-system
kubectl describe node.infra.homecluster.dev <node-name> -n homelab-autoscaler-system
# Check events
kubectl get events -n homelab-autoscaler-system --sort-by='.lastTimestamp'
Component-Specific Debugging
Controller Issues
Symptoms
- Resources not being reconciled
- Status not updating
- Controllers crashing or restarting
Debugging Steps
# Check controller pod status
kubectl get pods -n homelab-autoscaler-system -l control-plane=controller-manager
# Get detailed pod information
kubectl describe pod -n homelab-autoscaler-system -l control-plane=controller-manager
# Check controller logs for errors
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager | grep ERROR
# Check for reconciliation loops
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager | grep "Reconciling"
Common Issues
-
RBAC Permissions
# Check service account
kubectl get serviceaccount -n homelab-autoscaler-system
# Check role bindings
kubectl get rolebinding,clusterrolebinding | grep homelab-autoscaler -
CRD Installation
# Verify CRDs are installed
kubectl get crd infra.homecluster.dev
kubectl get crd groups.infra.homecluster.dev
kubectl get crd nodes.infra.homecluster.dev -
Resource Validation
# Check for validation errors
kubectl apply -f config/samples/ --dry-run=server
gRPC Server Issues
Symptoms
- Cluster Autoscaler cannot connect
- gRPC methods returning errors
- Scaling operations not working
Debugging Steps
# Check if gRPC server is running
kubectl get pods -n homelab-autoscaler-system
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager | grep "gRPC"
# Port forward to gRPC server
kubectl port-forward -n homelab-autoscaler-system service/homelab-autoscaler-grpc 50051:50051 &
# Test gRPC connectivity
grpcurl -plaintext localhost:50051 list
# Test specific methods
grpcurl -plaintext -d '{}' localhost:50051 externalgrpc.CloudProvider/NodeGroups
Known gRPC Issues
-
NodeGroupTargetSize Logic Bug
- Symptom: Incorrect target sizes returned
- Location:
internal/grpcserver/server.go:306-311 - Workaround: Manual node management
-
NodeGroupDecreaseTargetSize Not Persisting
- Symptom: Scale-down operations appear to work but don't persist
- Location:
internal/grpcserver/server.go:461-473 - Workaround: Manual power state changes
Node Power Management Issues
Symptoms
- Nodes stuck in transitional states
- Power operations failing
💡 Future Solution: The FSM Architecture provides a comprehensive solution for state management issues with formal state transitions, coordination lock integration, and automatic error recovery mechanisms.
- Jobs not completing
Debugging Steps
# Check node status
kubectl get nodes.infra.homecluster.dev -n homelab-autoscaler-system
# Check power operation jobs
kubectl get jobs -n homelab-autoscaler-system
# Check job logs
kubectl logs -n homelab-autoscaler-system job/<job-name>
# Check job pod logs
kubectl get pods -n homelab-autoscaler-system -l job-name=<job-name>
kubectl logs -n homelab-autoscaler-system <pod-name>
Common Power Issues
-
Startup Jobs Failing
# Check startup job logs
kubectl logs -n homelab-autoscaler-system job/node-startup-<hash>
# Common issues:
# - Network connectivity to target node
# - Incorrect credentials
# - Wrong MAC address for WoL
# - IPMI/BMC not accessible -
Shutdown Jobs Failing
# Check shutdown job logs
kubectl logs -n homelab-autoscaler-system job/node-shutdown-<hash>
# Common issues:
# - SSH connectivity problems
# - Insufficient permissions
# - Node already powered off
# - Network timeouts -
Nodes Stuck in Starting Up
# Check if node actually powered on
ping <node-ip>
# Check if kubelet is running on node
ssh <node> "systemctl status kubelet"
# Check node registration
kubectl get nodes | grep <node-name>
Group Management Issues
Symptoms
- Groups not managing nodes
- Health status not updating
- Autoscaling policies not working
Debugging Steps
# Check group status
kubectl get groups -n homelab-autoscaler-system -o wide
# Check group conditions
kubectl describe group <group-name> -n homelab-autoscaler-system
# Check nodes in group
kubectl get nodes.infra.homecluster.dev -n homelab-autoscaler-system -l group=<group-name>
Known Group Issues
-
Group Controller Incomplete
- Symptom: Groups only show "Loaded" condition
- Cause: Controller only implements basic status setting
- Workaround: Manual node management
-
Health Status Not Updating
- Symptom: Group health always shows "unknown"
- Cause: Health calculation logic not implemented
- Workaround: Check individual node health
Diagnostic Commands
System Overview
#!/bin/bash
# homelab-autoscaler-debug.sh - Comprehensive system check
echo "=== Homelab Autoscaler Debug Report ==="
echo "Generated: $(date)"
echo
echo "=== Namespace and Pods ==="
kubectl get pods -n homelab-autoscaler-system
echo
echo "=== CRDs ==="
kubectl get crds | grep homecluster
echo
echo "=== Groups ==="
kubectl get groups -n homelab-autoscaler-system -o wide
echo
echo "=== Nodes ==="
kubectl get nodes.infra.homecluster.dev -n homelab-autoscaler-system -o wide
echo
echo "=== Jobs ==="
kubectl get jobs -n homelab-autoscaler-system
echo
echo "=== Recent Events ==="
kubectl get events -n homelab-autoscaler-system --sort-by='.lastTimestamp' | tail -10
echo
echo "=== Controller Logs (last 20 lines) ==="
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager --tail=20
gRPC Testing
#!/bin/bash
# grpc-test.sh - Test gRPC server functionality
# Port forward to gRPC server
kubectl port-forward -n homelab-autoscaler-system service/homelab-autoscaler-grpc 50051:50051 &
PF_PID=$!
sleep 2
echo "=== gRPC Server Test ==="
echo "Testing connectivity..."
grpcurl -plaintext localhost:50051 list
echo
echo "Testing NodeGroups..."
grpcurl -plaintext -d '{}' localhost:50051 externalgrpc.CloudProvider/NodeGroups
echo
echo "Testing NodeGroupTargetSize..."
grpcurl -plaintext -d '{"id": "test-group"}' localhost:50051 externalgrpc.CloudProvider/NodeGroupTargetSize
# Clean up
kill $PF_PID
Performance Debugging
Resource Usage
# Check controller resource usage
kubectl top pods -n homelab-autoscaler-system
# Check node resource usage
kubectl top nodes
# Check for resource limits
kubectl describe pod -n homelab-autoscaler-system -l control-plane=controller-manager | grep -A 5 -B 5 "Limits\|Requests"
Memory and CPU Analysis
# Get detailed resource metrics
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/homelab-autoscaler-system/pods
# Check for memory leaks in logs
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager | grep -i "memory\|oom"
Network Debugging
Connectivity Issues
# Test cluster DNS
kubectl run debug-pod --image=busybox --rm -it -- nslookup kubernetes.default
# Test service connectivity
kubectl run debug-pod --image=busybox --rm -it -- wget -qO- http://homelab-autoscaler-grpc.homelab-autoscaler-system.svc.cluster.local:50051
# Check service endpoints
kubectl get endpoints -n homelab-autoscaler-system
gRPC Connectivity
# Test gRPC from within cluster
kubectl run grpc-test --image=fullstorydev/grpcurl --rm -it -- \
grpcurl -plaintext homelab-autoscaler-grpc.homelab-autoscaler-system.svc.cluster.local:50051 list
Log Analysis
Error Patterns
# Common error patterns to search for
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager | grep -E "(ERROR|FATAL|panic|failed)"
# Reconciliation issues
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager | grep "reconcile"
# gRPC errors
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager | grep -i grpc
Structured Log Analysis
# Extract JSON logs if using structured logging
kubectl logs -n homelab-autoscaler-system deployment/homelab-autoscaler-controller-manager | jq 'select(.level == "error")'
Recovery Procedures
Restart Components
# Restart controller
kubectl rollout restart deployment/homelab-autoscaler-controller-manager -n homelab-autoscaler-system
# Force pod recreation
kubectl delete pods -n homelab-autoscaler-system -l control-plane=controller-manager
Reset Resource States
# Clear finalizers if resources are stuck
kubectl patch group <group-name> -n homelab-autoscaler-system -p '{"metadata":{"finalizers":[]}}' --type=merge
# Reset node power state
kubectl patch node.infra.homecluster.dev <node-name> -n homelab-autoscaler-system \
--type='merge' -p='{"spec":{"powerState":"off"}}'
Clean Up Failed Jobs
# Remove failed jobs
kubectl delete jobs -n homelab-autoscaler-system --field-selector=status.successful=0
# Clean up completed jobs
kubectl delete jobs -n homelab-autoscaler-system --field-selector=status.successful=1
When to Escalate
Critical Issues Requiring Code Fixes
- gRPC Logic Bugs: Cannot be worked around, require code changes
- Controller Race Conditions: May cause data corruption
- Missing Node Draining: Risk of data loss
Issues That Can Be Worked Around
- Incomplete Group Controller: Use manual node management
- Missing Error Handling: Monitor and restart manually
- Namespace Hardcoding: Deploy in correct namespace
Related Documentation
- Known Issues - Critical bugs and limitations
- Architecture Overview - System design
- Development Setup - Setting up for debugging
- API Reference - Resource specifications