Known Issues

This document outlines current known issues, limitations, and workarounds for the homelab-autoscaler system.

Current Limitations

1. Configuration Constraints

Advanced gRPC Configuration

Location: internal/grpcserver/server.go
Description: Limited customization options for gRPC server behavior
Impact: May require code changes for specific deployment needs
Status: ✅ STABLE - Works reliably with default configuration
Enhancement: Additional configuration options planned

Node Draining Customization

Location: Node shutdown process
Description: Standard pod eviction process with limited customization
Impact: May not suit all workload types
Status: ✅ IMPLEMENTED - Graceful pod eviction working
Enhancement: Advanced draining policies planned

2. Operational Considerations

State Management

Location: internal/controller/infra/node_controller.go
Description: Core FSM implemented, power operations in development
Impact: Basic state transitions work, power operations require completion
Status: 🚧 PARTIAL - FSM architecture ready, power operations in development
Architecture: FSM Architecture provides robust foundation

Group Controller Features

Location: internal/controller/infra/group_controller.go
Description: Full autoscaling policy management and health monitoring
Impact: Complete autoscaling functionality
Status: ✅ OPERATIONAL - Manages groups and policies effectively
Enhancement: Advanced policy features in development

Functional Limitations

1. Error Handling

Missing Comprehensive Error Recovery

Symptom: System doesn't gracefully handle job failures
Impact: Manual intervention required for stuck operations
Status: ⚠️ LIMITED - Basic error handling only
Workaround: Monitor logs and restart manually

Job Timeout Handling

Symptom: Jobs that exceed timeout may leave system in inconsistent state
Impact: Requires manual cleanup
Status: ⚠️ PARTIAL - Basic timeout detection implemented
Workaround: Monitor job status and clean up manually

2. Configuration Limitations

Namespace Hardcoding

Location: Multiple controller files
Symptom: System only works in homelab-autoscaler-system namespace
Impact: No multi-tenancy support
Status: ⚠️ HARDCODED - Configuration option needed
Workaround: Deploy in correct namespace

Limited Customization

Symptom: Few configuration options for timeouts and thresholds
Impact: One-size-fits-all behavior
Status: ⚠️ LIMITED - Basic configuration only
Workaround: Modify source code for custom behavior

3. Monitoring and Observability

Limited Metrics

Symptom: Few Prometheus metrics exposed
Impact: Difficult to monitor system health
Status: ⚠️ BASIC - Controller-runtime metrics only
Workaround: Monitor logs and resource status

Insufficient Logging

Symptom: Limited structured logging for debugging
Impact: Difficult to troubleshoot issues
Status: ⚠️ BASIC - Basic logging implemented
Workaround: Increase log verbosity

Development Status Issues

1. Testing Coverage

Limited Integration Tests

Symptom: Few end-to-end test scenarios
Impact: Bugs may not be caught before release
Status: ⚠️ PARTIAL - Basic tests only
Solution: Comprehensive test suite needed

Missing Unit Tests for gRPC

Symptom: gRPC server methods not thoroughly tested
Impact: Logic bugs not caught in CI
Status: ⚠️ MISSING - No gRPC-specific tests
Solution: Add comprehensive gRPC test coverage

2. Documentation Gaps

API Documentation

Symptom: Limited API reference documentation
Impact: Difficult for users to understand CRD schemas
Status: ⚠️ PARTIAL - Basic CRD docs only
Solution: Comprehensive API documentation

Troubleshooting Guides

Symptom: Limited troubleshooting information
Impact: Users struggle to debug issues
Status: ✅ IMPROVED - Debugging guide available
Solution: Continue expanding troubleshooting docs

Workarounds and Mitigation

For Development and Testing

Use Manual Node Management

# Manually set power states instead of relying on gRPC
kubectl patch node.infra.homecluster.dev <node-name> \
  --type='merge' -p='{"spec":{"powerState":"on"}}'

Monitor Job Status

# Watch for stuck jobs and clean up manually
kubectl get jobs -n homelab-autoscaler-system --watch
kubectl delete jobs -n homelab-autoscaler-system --field-selector=status.successful=0

Manual Pod Draining

# Drain nodes before shutdown
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

For Production Deployment

Review Configuration Options
- Priority: MEDIUM
- Customize settings for your environment
Implement Monitoring
- Priority: MEDIUM
- Set up observability for production operations
Configure Security
- Priority: HIGH
- Implement proper RBAC and network policies
Test Scaling Scenarios
- Priority: HIGH
- Validate autoscaling behavior in your environment

Issue Tracking

How to Report Issues

Check This Document First - Verify if the issue is already known
Gather Debug Information - Use the Debugging Guide
Create GitHub Issue - Include logs, configuration, and reproduction steps

Contributing Fixes

Review Architecture - Understand the system design
Check FSM Implementation - Consider the FSM Architecture for state management fixes
Add Tests - Include unit and integration tests with fixes
Update Documentation - Update this document when issues are resolved

Debugging Guide - How to troubleshoot issues
Architecture Overview - System design and components
FSM Architecture - Planned state management improvements
Development Setup - Setting up development environment

Current Limitations​

1. Configuration Constraints​

Advanced gRPC Configuration​

Node Draining Customization​

2. Operational Considerations​

State Management​

Group Controller Features​

Functional Limitations​

1. Error Handling​

Missing Comprehensive Error Recovery​

Job Timeout Handling​

2. Configuration Limitations​

Namespace Hardcoding​

Limited Customization​

3. Monitoring and Observability​

Limited Metrics​

Insufficient Logging​

Development Status Issues​

1. Testing Coverage​

Limited Integration Tests​

Missing Unit Tests for gRPC​

2. Documentation Gaps​

API Documentation​

Troubleshooting Guides​

Workarounds and Mitigation​

For Development and Testing​

For Production Deployment​

Issue Tracking​

How to Report Issues​

Contributing Fixes​

Related Documentation​

Current Limitations

1. Configuration Constraints

Advanced gRPC Configuration

Node Draining Customization

2. Operational Considerations

State Management

Group Controller Features

Functional Limitations

1. Error Handling

Missing Comprehensive Error Recovery

Job Timeout Handling

2. Configuration Limitations

Namespace Hardcoding

Limited Customization

3. Monitoring and Observability

Limited Metrics

Insufficient Logging

Development Status Issues

1. Testing Coverage

Limited Integration Tests

Missing Unit Tests for gRPC

2. Documentation Gaps

API Documentation

Troubleshooting Guides

Workarounds and Mitigation

For Development and Testing

For Production Deployment

Issue Tracking

How to Report Issues

Contributing Fixes

Related Documentation