AWS Cost Optimization Using Lambda Functions and Terraform

Written by:

Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect
LinkedIn

Introduction

In modern cloud infrastructure, cost optimization and proactive incident prevention are crucial for maintaining efficient operations. This document outlines our implementation of AWS Lambda-based scheduling and monitoring systems that help reduce costs and prevent potential issues before they impact production.

💡 Note: For the complete code implementation and examples, please check my GitHub repository linked in the References section below.

Resource Scheduling System

We utilize several specialized Lambda functions to manage different AWS resources:

1. ASG (Auto Scaling Group) Scheduler

The ASG scheduler manages compute resources based on time schedules:

def lambda_handler(event, context):
    """
    Manages Auto Scaling Groups based on schedule:
    - Working hours (9:00-18:00): Normal capacity
    - Off hours: Minimum capacity
    - Weekends: Zero capacity (for non-production)
    """
    try:
        asg_name = event.get('asg_name')
        if is_weekend():
            update_asg_capacity(asg_name, min=0, desired=0, max=0)
        elif is_working_hours():
            update_asg_capacity(asg_name, min=1, desired=2, max=4)
        else:
            update_asg_capacity(asg_name, min=1, desired=1, max=2)
    except Exception as e:
        logger.error(f"ASG scheduling failed: {str(e)}")

2. RDS (Relational Database Service) Maintenance Scheduler

The RDS scheduler handles database maintenance tasks:

def lambda_handler(event, context):
"""
Manages RDS instances:
Stops development databases during off-hours
Maintains production databases 24/7
Schedules maintenance windows
"""
for instance in get_rds_instances():
if instance.tags.get('Environment') != 'production':
if is_off_hours():
stop_rds_instance(instance.id)
else:
start_rds_instance(instance.id)

3. EKS (Elastic Kubernetes Service) Scheduler

The EKS scheduler manages Kubernetes clusters:

def lambda_handler(event, context):
"""
Manages EKS node groups:
Scales down during off-hours
Adjusts capacity based on workload patterns
"""
for nodegroup in list_nodegroups():
if should_scale_down(nodegroup):
update_nodegroup_size(nodegroup, desired=0)
else:
restore_nodegroup_capacity(nodegroup)

Incident Prevention System

CloudWatch Metrics Monitoring

Our system implements proactive monitoring of critical metrics:

Database Metrics
Storage space utilization
CPU usage
Connection count
IOPS utilization
Application Metrics
Response times
Error rates
Queue lengths
Memory usage

Automated Prevention Actions

Example of automated response to metrics:

def handle_metric_alarm(event, context):
"""
Responds to CloudWatch alarms:
Executes database maintenance (VACUUM)
Adjusts resource capacity
Sends notifications
"""
metric_name = event['detail']['metricName']
if metric_name == 'FreeStorageSpace':
execute_vacuum_maintenance()
elif metric_name == 'CPUUtilization':
scale_compute_resources()

Slack Notifications

Our system sends notifications to Slack channels:

def handle_metric_alarm(event, context):

Responds to CloudWatch alarms:
Executes database maintenance (VACUUM)
Adjusts resource capacity
Sends notifications

IAM Security Configuration

Each Lambda function has specific IAM roles with least-privilege access:

EC2 Scheduler Role
resource "aws_iam_role" "ec2_scheduler_lambda" {
name = "Ec2SchedulerLambda"
# Permissions for EC2 management
}
RDS Scheduler Role
resource "aws_iam_role" "rds_scheduler_lambda" {
name = "RDSSchedulerLambda"
# Permissions for RDS management
}
EKS Scheduler Role
resource "aws_iam_role" "eks_scheduler_lambda" {
name = "eksSchedulerLambda"
# Permissions for EKS management
}

Cost Optimization Features

1. Automated Resource Management 🤖

Scheduled start/stop of development resources
Capacity adjustment based on usage patterns
Weekend and holiday scheduling
⏰ Shutdown of dev/stage/test EKS clusters during off-hours (~8-12 hours/day)
🛑 Stopping RDS instances for dev/stage environments
📉 Reducing EC2 instances count in ASG for dev/stage environments

2. Preventive Maintenance 🔧

Automated database VACUUM operations
Storage space monitoring
Performance optimization

3. Resource Right-sizing 📊

Regular utilization analysis
Automatic scaling adjustments
Cost-effective resource allocation

Benefits Achieved

1. Cost Reduction 💰

40-60% reduction in development environment costs through:
- EKS clusters shutdown during off-hours
- RDS instances stoppage during non-working hours
- Reducing EC2 instances in ASG
Elimination of idle resource costs during weekends
Monthly savings of approximately $5000-7000 on dev/stage environments

2. Improved Reliability ⚡

Zero downtime due to storage issues
Proactive issue detection
Automated maintenance procedures

3. Operational Efficiency 🎯

Reduced manual intervention
Consistent resource management
Automated incident response

Implementation Details

Here's an example of our RDS maintenance implementation:

def maintenance_task():
try:
# Connect to database
conn = connect_to_database()
# Execute maintenance
execute_vacuum_full()
# Notify success
send_notification("Maintenance completed successfully")
finally:
# Auto-terminate instance
terminate_instance()
)
## Monitoring and Alerting

Monitoring and Alerting

Critical Metrics
Database storage utilization
Application error rates
Resource utilization patterns
Performance metrics
Alert Thresholds
Warning: 70% utilization
Critical: 85% utilization
Emergency: 95% utilization
Response Actions
Automated maintenance
Resource scaling
Team notifications

Best Practices

Resource Tagging

tags = {
    Name = "resource-name"
    Environment = "dev"
    Schedule = "business-hours"
}

Monitoring Configuration
Set appropriate thresholds based on historical data
Implement graduated response actions
Maintain comprehensive monitoring documentation
Security Measures
Use least-privilege IAM roles
Implement proper error handling
Maintain audit logs

Implementation Example

Here's an example of our RDS maintenance implementation:

def maintenance_task():
try:
# Connect to database
conn = connect_to_database()
# Execute maintenance
execute_vacuum_full()
# Notify success
send_notification("Maintenance completed successfully")
finally:
# Auto-terminate instance
terminate_instance()
)
## Monitoring and Alerting

Monitoring and Alerting

Critical Metrics
Database storage utilization
Application error rates
Resource utilization patterns
Performance metrics
Alert Thresholds
Warning: 70% utilization
Critical: 85% utilization
Emergency: 95% utilization
Response Actions
Automated maintenance
Resource scaling
Team notifications

Conclusion

Our AWS Lambda-based scheduling and monitoring system has proven highly effective in:
- Reducing operational costs through automated resource management
- Preventing incidents through proactive monitoring
- Improving system reliability through automated maintenance
- Reducing team workload through automation

The combination of scheduled resource management and proactive monitoring ensures optimal resource utilization while maintaining system stability and performance.

References

📚 Github repository - Complete implementation of AWS Lambda schedulers for cost optimization
AWS Lambda Documentation
CloudWatch Documentation
AWS Auto Scaling
AWS RDS Documentation