AWS Cost Optimization Using Lambda Functions and Terraform
Written by:
Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect
LinkedIn
Introduction
In modern cloud infrastructure, cost optimization and proactive incident prevention are crucial for maintaining efficient operations. This document outlines our implementation of AWS Lambda-based scheduling and monitoring systems that help reduce costs and prevent potential issues before they impact production.
💡 Note: For the complete code implementation and examples, please check my GitHub repository linked in the References section below.
Resource Scheduling System
We utilize several specialized Lambda functions to manage different AWS resources:
1. ASG (Auto Scaling Group) Scheduler
The ASG scheduler manages compute resources based on time schedules:
def lambda_handler(event, context):
"""
Manages Auto Scaling Groups based on schedule:
- Working hours (9:00-18:00): Normal capacity
- Off hours: Minimum capacity
- Weekends: Zero capacity (for non-production)
"""
try:
asg_name = event.get('asg_name')
if is_weekend():
update_asg_capacity(asg_name, min=0, desired=0, max=0)
elif is_working_hours():
update_asg_capacity(asg_name, min=1, desired=2, max=4)
else:
update_asg_capacity(asg_name, min=1, desired=1, max=2)
except Exception as e:
logger.error(f"ASG scheduling failed: {str(e)}")
2. RDS (Relational Database Service) Maintenance Scheduler
The RDS scheduler handles database maintenance tasks:
def lambda_handler(event, context):
"""
Manages RDS instances:
Stops development databases during off-hours
Maintains production databases 24/7
Schedules maintenance windows
"""
for instance in get_rds_instances():
if instance.tags.get('Environment') != 'production':
if is_off_hours():
stop_rds_instance(instance.id)
else:
start_rds_instance(instance.id)
3. EKS (Elastic Kubernetes Service) Scheduler
The EKS scheduler manages Kubernetes clusters:
def lambda_handler(event, context):
"""
Manages EKS node groups:
Scales down during off-hours
Adjusts capacity based on workload patterns
"""
for nodegroup in list_nodegroups():
if should_scale_down(nodegroup):
update_nodegroup_size(nodegroup, desired=0)
else:
restore_nodegroup_capacity(nodegroup)
Incident Prevention System
CloudWatch Metrics Monitoring
Our system implements proactive monitoring of critical metrics:
- Database Metrics
- Storage space utilization
- CPU usage
- Connection count
-
IOPS utilization
-
Application Metrics
- Response times
- Error rates
- Queue lengths
- Memory usage
Automated Prevention Actions
Example of automated response to metrics:
def handle_metric_alarm(event, context):
"""
Responds to CloudWatch alarms:
Executes database maintenance (VACUUM)
Adjusts resource capacity
Sends notifications
"""
metric_name = event['detail']['metricName']
if metric_name == 'FreeStorageSpace':
execute_vacuum_maintenance()
elif metric_name == 'CPUUtilization':
scale_compute_resources()
Slack Notifications
Our system sends notifications to Slack channels:
def handle_metric_alarm(event, context):
Responds to CloudWatch alarms:
Executes database maintenance (VACUUM)
Adjusts resource capacity
Sends notifications
IAM Security Configuration
Each Lambda function has specific IAM roles with least-privilege access:
EC2 Scheduler Role
resource "aws_iam_role" "ec2_scheduler_lambda" {
name = "Ec2SchedulerLambda"
# Permissions for EC2 management
}
RDS Scheduler Role
resource "aws_iam_role" "rds_scheduler_lambda" {
name = "RDSSchedulerLambda"
# Permissions for RDS management
}
EKS Scheduler Role
resource "aws_iam_role" "eks_scheduler_lambda" {
name = "eksSchedulerLambda"
# Permissions for EKS management
}
Cost Optimization Features
1. Automated Resource Management 🤖
- Scheduled start/stop of development resources
- Capacity adjustment based on usage patterns
- Weekend and holiday scheduling
- ⏰ Shutdown of dev/stage/test EKS clusters during off-hours (~8-12 hours/day)
- 🛑 Stopping RDS instances for dev/stage environments
- 📉 Reducing EC2 instances count in ASG for dev/stage environments
2. Preventive Maintenance 🔧
- Automated database VACUUM operations
- Storage space monitoring
- Performance optimization
3. Resource Right-sizing 📊
- Regular utilization analysis
- Automatic scaling adjustments
- Cost-effective resource allocation
Benefits Achieved
1. Cost Reduction 💰
- 40-60% reduction in development environment costs through:
- EKS clusters shutdown during off-hours
- RDS instances stoppage during non-working hours
- Reducing EC2 instances in ASG
- Elimination of idle resource costs during weekends
- Monthly savings of approximately $5000-7000 on dev/stage environments
2. Improved Reliability ⚡
- Zero downtime due to storage issues
- Proactive issue detection
- Automated maintenance procedures
3. Operational Efficiency 🎯
- Reduced manual intervention
- Consistent resource management
- Automated incident response
Implementation Details
Here's an example of our RDS maintenance implementation:
def maintenance_task():
try:
# Connect to database
conn = connect_to_database()
# Execute maintenance
execute_vacuum_full()
# Notify success
send_notification("Maintenance completed successfully")
finally:
# Auto-terminate instance
terminate_instance()
)
## Monitoring and Alerting
Monitoring and Alerting
- Critical Metrics
- Database storage utilization
- Application error rates
- Resource utilization patterns
-
Performance metrics
-
Alert Thresholds
- Warning: 70% utilization
- Critical: 85% utilization
-
Emergency: 95% utilization
-
Response Actions
- Automated maintenance
- Resource scaling
- Team notifications
Best Practices
-
Resource Tagging
tags = { Name = "resource-name" Environment = "dev" Schedule = "business-hours" }
-
Monitoring Configuration
- Set appropriate thresholds based on historical data
- Implement graduated response actions
-
Maintain comprehensive monitoring documentation
-
Security Measures
- Use least-privilege IAM roles
- Implement proper error handling
- Maintain audit logs
Implementation Example
Here's an example of our RDS maintenance implementation:
def maintenance_task():
try:
# Connect to database
conn = connect_to_database()
# Execute maintenance
execute_vacuum_full()
# Notify success
send_notification("Maintenance completed successfully")
finally:
# Auto-terminate instance
terminate_instance()
)
## Monitoring and Alerting
Monitoring and Alerting
- Critical Metrics
- Database storage utilization
- Application error rates
- Resource utilization patterns
-
Performance metrics
-
Alert Thresholds
- Warning: 70% utilization
- Critical: 85% utilization
-
Emergency: 95% utilization
-
Response Actions
- Automated maintenance
- Resource scaling
- Team notifications
Conclusion
Our AWS Lambda-based scheduling and monitoring system has proven highly effective in:
- Reducing operational costs through automated resource management
- Preventing incidents through proactive monitoring
- Improving system reliability through automated maintenance
- Reducing team workload through automation
The combination of scheduled resource management and proactive monitoring ensures optimal resource utilization while maintaining system stability and performance.
References
- 📚 Github repository - Complete implementation of AWS Lambda schedulers for cost optimization
- AWS Lambda Documentation
- CloudWatch Documentation
- AWS Auto Scaling
- AWS RDS Documentation