A manufacturing company learned this lesson the hard way: when cryptominers hit their AWS environment, it took 14 hours to detect and 3 days to contain. Logs vanished as instances terminated, manual forensics proved impossible in auto-scaling groups, and their on-prem IR playbooks were useless.
Cloud incident response requires fundamentally different approaches than traditional on-premises IR. Ephemeral resources, API-driven operations, distributed systems, and the shared responsibility model demand new playbooks, tools, and mindsets.
After building cloud IR capabilities for dozens of organizations, here's our framework for creating playbooks that work in production.
Why traditional IR fails in the cloud
Traditional incident response assumes:
- Persistent infrastructure: Servers stay online for forensics
- Physical access: You can pull disk images and memory dumps
- Network boundaries: Clear perimeter to monitor and block
- Slow-moving attacks: Time to manually investigate before damage spreads
Cloud environments break all these assumptions:
Ephemeral resources mean evidence disappears:
- Auto-scaling terminates compromised instances
- Container pods restart and lose file system state
- Lambda functions execute and vanish in seconds
- Spot instances disappear with no warning
API-first operations require programmatic response:
- Manual console clicking is too slow for cloud-speed attacks
- Isolation requires API calls, not network cables
- Forensics happens via cloud APIs, not physical access
Distributed architecture complicates containment:
- Attacks span multiple regions, accounts, and cloud providers
- Microservices architecture means lateral movement happens at API layer
- Serverless functions can be attack vectors with no "host" to investigate
The cloud IR playbook framework
Effective cloud playbooks have five components:
1. Detection & Alert Trigger
Define the specific security events that trigger the playbook:
Playbook: AWS EC2 Cryptomining Response
Trigger Conditions:
- CloudWatch Alarm: EC2 CPU utilization >95% for >10 minutes
- GuardDuty Finding: CryptoCurrency:EC2/BitcoinTool.B!DNS
- Security Hub: Unusual outbound traffic to known mining pools
- Cost anomaly: Sudden spike in EC2 compute spend
2. Automated Evidence Collection
Before taking any containment actions, preserve evidence. In the cloud, this must happen immediately - before resources disappear.
AWS EC2 incident example:
def collect_ec2_evidence(instance_id, region):
"""
Automated evidence collection for EC2 instance
Must complete in <5 minutes before auto-scaling terminates instance
"""
evidence = {
'instance_id': instance_id,
'timestamp': datetime.utcnow(),
'collected_artifacts': []
}
# 1. Snapshot EBS volumes (most critical)
volumes = ec2.describe_volumes(
Filters=[{'Name': 'attachment.instance-id', 'Values': [instance_id]}]
)
for volume in volumes['Volumes']:
snapshot = ec2.create_snapshot(
VolumeId=volume['VolumeId'],
Description=f"IR evidence: {instance_id}",
TagSpecifications=[{
'ResourceType': 'snapshot',
'Tags': [
{'Key': 'incident-id', 'Value': incident_id},
{'Key': 'evidence', 'Value': 'true'},
{'Key': 'retain-until', 'Value': retention_date}
]
}]
)
evidence['collected_artifacts'].append({
'type': 'ebs_snapshot',
'id': snapshot['SnapshotId']
})
# 2. Capture memory dump (if SSM agent available)
try:
ssm.send_command(
InstanceIds=[instance_id],
DocumentName='AWS-RunShellScript',
Parameters={
'commands': [
'sudo apt-get install -y lime-forensics',
f'sudo lime /path/to/output.lime',
f'aws s3 cp /path/to/output.lime s3://{evidence_bucket}/memory/{instance_id}/'
]
}
)
evidence['collected_artifacts'].append({'type': 'memory_dump'})
except:
evidence['errors'].append('memory_dump_failed')
# 3. Collect CloudTrail logs
cloudtrail_logs = query_cloudtrail(
start_time=incident_start_time,
filters={'userIdentity.arn': instance_role_arn}
)
save_to_s3(cloudtrail_logs, f'evidence/{instance_id}/cloudtrail.json')
# 4. Collect VPC Flow Logs
flow_logs = query_flow_logs(
instance_id=instance_id,
start_time=incident_start_time
)
save_to_s3(flow_logs, f'evidence/{instance_id}/flow_logs.json')
# 5. Collect instance metadata and configuration
instance_details = ec2.describe_instances(InstanceIds=[instance_id])
security_groups = ec2.describe_security_groups(
GroupIds=[sg['GroupId'] for sg in instance_details['SecurityGroups']]
)
evidence['instance_config'] = {
'instance_details': instance_details,
'security_groups': security_groups,
'iam_role': instance_role_arn
}
return evidence
Key evidence to collect automatically:
For compute (EC2, VMs):
- Disk snapshots/images
- Memory dumps (if agent available)
- Process lists and network connections
- System logs
- User accounts and cron jobs
For cloud-native logs:
- CloudTrail / Azure Activity Logs / GCP Cloud Audit Logs (IAM changes, API calls)
- VPC Flow Logs / NSG Flow Logs (network traffic)
- DNS query logs
- Application/service logs
For serverless (Lambda, Cloud Functions):
- Function code versions
- Execution logs
- IAM policies and environment variables
- Invocation history and triggers
3. Containment Actions
After evidence is secured, contain the threat. In the cloud, containment is API-driven and can be automated.
Containment decision tree for EC2 cryptomining:
Is this a production instance?
├─ Yes → Can we afford downtime?
│ ├─ Yes → Isolate and terminate
│ └─ No → Isolate only, schedule maintenance
└─ No → Immediate isolation and termination
Isolation steps:
1. Modify security groups → Deny all inbound/outbound (except forensics access)
2. Detach IAM role → Prevent further AWS API abuse
3. Apply instance tag → "quarantined-do-not-delete"
4. Disable auto-scaling → Prevent termination before forensics complete
5. Alert stakeholders → Notify app owners, IR team, management
Example isolation automation:
def isolate_ec2_instance(instance_id, incident_id):
"""
Isolate compromised EC2 instance while preserving for forensics
"""
# Create forensics-only security group if doesn't exist
forensics_sg = get_or_create_forensics_sg()
# Replace instance security groups
ec2.modify_instance_attribute(
InstanceId=instance_id,
Groups=[forensics_sg] # Only allows SSH from forensics jump host
)
# Detach IAM role to prevent API abuse
ec2.disassociate_iam_instance_profile(
AssociationId=get_instance_profile_association(instance_id)
)
# Tag for tracking and protection
ec2.create_tags(
Resources=[instance_id],
Tags=[
{'Key': 'security-status', 'Value': 'quarantined'},
{'Key': 'incident-id', 'Value': incident_id},
{'Key': 'do-not-delete', 'Value': 'forensics-required'},
{'Key': 'isolated-at', 'Value': str(datetime.utcnow())}
]
)
# Disable termination protection in auto-scaling
update_autoscaling_group(instance_id, protect_from_scale_in=True)
# Revoke any active sessions
ssm.terminate_session(instance_id=instance_id)
return {
'status': 'isolated',
'instance_id': instance_id,
'forensics_access': forensics_sg,
'timestamp': datetime.utcnow()
}
4. Investigation & Analysis
With evidence collected and threat contained, investigate root cause and scope.
Key investigation questions for cloud incidents:
How did the attacker get in?
- Compromised credentials? (Search CloudTrail for unusual access patterns)
- Vulnerable application? (Review application logs, WAF alerts)
- Misconfigured resource? (Check public S3 buckets, open security groups)
- Supply chain? (Review third-party integrations, compromised dependencies)
What did the attacker do?
- API calls made? (CloudTrail analysis)
- Data accessed or exfiltrated? (S3 access logs, data transfer metrics)
- Resources created? (Search for new instances, Lambda functions, IAM users)
- Lateral movement? (Cross-account assume role activity)
What's the blast radius?
- Which accounts/regions affected?
- Which applications/services impacted?
- What data was exposed?
- Are there other compromised resources?
5. Remediation & Recovery
After understanding the incident, remediate vulnerabilities and recover operations.
Immediate remediation for cryptomining incident:
Remediation Checklist:
Identity & Access:
- Rotate all IAM access keys for affected accounts
- Review and remove any unauthorized IAM users/roles
- Enable MFA on all privileged accounts
- Audit IAM policies for overly permissive access
Network & Compute:
- Patch vulnerable instances
- Update security groups to least-privilege
- Enable VPC Flow Logs if not already enabled
- Review and remove any unauthorized instances
Detection & Monitoring:
- Enable GuardDuty in all regions and accounts
- Configure CloudWatch billing alarms
- Deploy runtime protection (Falcon, Wiz, Prisma Cloud)
- Create detection rules for cryptomining indicators
Prevention:
- Implement Service Control Policies to prevent unapproved instance types
- Require instance tagging and ownership
- Enable AWS Config rules for compliance monitoring
- Implement least-privilege IAM everywhere
Example playbook: AWS S3 data exfiltration
Let's walk through a complete playbook:
Trigger: GuardDuty alert "Exfiltration:S3/AnomalousBehavior" or unusual S3 data transfer spike
1. Evidence Collection (automated):
# Collect S3 access logs for affected bucket
logs = query_s3_access_logs(
bucket=affected_bucket,
time_range=last_24_hours
)
# Collect CloudTrail for S3 API calls
cloudtrail = query_cloudtrail(
event_names=['GetObject', 'ListBucket', 'PutBucketPolicy'],
resources=[bucket_arn]
)
# Identify accessing IAM principal
principal = cloudtrail['userIdentity']['arn']
# Collect all actions by this principal
all_principal_activity = query_cloudtrail(
principal=principal,
time_range=last_7_days
)
2. Containment (automated):
# Deny access to bucket from compromised principal
apply_bucket_policy_deny(bucket=affected_bucket, principal=principal)
# Revoke active sessions for compromised IAM user/role
if principal.type == 'IAM_USER':
disable_iam_user(principal.name)
revoke_access_keys(principal.name)
elif principal.type == 'IAM_ROLE':
revoke_role_sessions(principal.name)
# Enable bucket versioning and object lock
enable_s3_versioning(bucket=affected_bucket)
enable_s3_object_lock(bucket=affected_bucket)
3. Investigation questions:
- How many objects were accessed? (S3 access logs)
- How much data was transferred? (CloudWatch metrics)
- Where was data sent? (VPC Flow Logs, S3 replication configs)
- Was the bucket policy modified? (CloudTrail)
- Are other buckets affected? (Search for similar access patterns)
4. Remediation:
- Restore any deleted objects from backups
- Review and tighten S3 bucket policies
- Enable S3 Block Public Access
- Implement S3 Access Points for granular access control
- Enable MFA Delete for critical buckets
Building your cloud IR program
Start with these playbooks:
Must-have playbooks (build these first):
- EC2/VM compromise and cryptomining
- S3 data exfiltration
- IAM credential compromise
- Publicly exposed resource (RDS, S3, ElastiCache)
- Insider threat / privilege abuse
Advanced playbooks: 6. Ransomware in cloud storage 7. Serverless function compromise (Lambda/Cloud Functions) 8. Container breakout and cluster compromise 9. Cross-account assume role abuse 10. Supply chain attack via compromised CI/CD
For each playbook, document:
- Trigger conditions and alert thresholds
- Automated evidence collection steps
- Containment decision tree
- Investigation procedures and queries
- Remediation checklist
- Communication plan (who to notify, what to say)
Testing your playbooks
Playbooks that aren't tested will fail when it matters. Implement regular testing:
Monthly: Tabletop exercises
- Walk through playbooks with IR team
- Identify gaps and unclear procedures
- Update documentation based on feedback
Quarterly: Purple team simulations
- Run realistic attack scenarios in non-prod
- Execute playbooks end-to-end
- Measure response times and automation effectiveness
- Fix any broken automation
Annually: Full IR simulation
- Simulate major incident across multiple accounts/regions
- Involve executive leadership
- Test communication procedures
- Validate disaster recovery and backup restoration
Conclusion
Cloud incident response requires purpose-built playbooks that account for ephemeral infrastructure, API-first operations, and distributed systems. The key principles:
- Automate evidence collection before resources disappear
- Use APIs for containment instead of manual console work
- Build decision trees for consistent response
- Test regularly with realistic scenarios
- Iterate continuously based on real incidents and testing
The manufacturing company that suffered the 3-day cryptomining incident? After implementing these cloud IR practices, they detected and contained a similar attack in 35 minutes with automated playbooks. The CISO told us: "We went from feeling helpless in the cloud to having faster response than our on-prem environment."
Ready to build modern cloud IR capabilities? Contact us to discuss how our Forward-Deployed Engineers can help.