Every SOC leader wants automation. But most automation projects fail because they try to automate everything at once, resulting in brittle workflows that break constantly and erode analyst trust.
Here's how to build runbooks that actually work.
Start with repetitive, low-risk tasks
Don't try to automate incident response. Start with the boring, repetitive tasks analysts do 50 times a day:
Good first automation targets:
- Enriching alerts with user/asset context
- Pulling historical activity for a user or host
- Collecting forensic artifacts (logs, registry keys, running processes)
- Checking threat intelligence feeds
- Creating tickets with standardized evidence
Bad first automation targets:
- Isolating hosts from the network
- Disabling user accounts
- Deleting files or killing processes
- Any action that could cause business disruption
The anatomy of a good runbook
A production-ready runbook has five components:
1. Trigger
What event kicks off this runbook?
- SIEM alert (e.g., "Suspicious PowerShell execution")
- Manual analyst action
- Scheduled job (e.g., daily threat hunt)
2. Inputs
What information does the runbook need?
- Alert ID
- Username or hostname
- Time window to investigate
- Severity level
3. Steps
Sequential actions the runbook performs:
1. Query SIEM for related alerts (same user, last 24h)
2. Pull user context from Active Directory
3. Check VirusTotal for any file hashes involved
4. Retrieve endpoint logs from EDR platform
5. Compile evidence into structured summary
4. Outputs
What does the runbook produce?
- Enriched ticket in Jira/ServiceNow
- Summary report (PDF or HTML)
- Slack notification to SOC channel
- Updated case notes in SIEM
5. Error handling
What happens when something breaks?
- API rate limits
- Missing data (no user found in AD)
- Timeouts or network failures
Always include a fallback: "If this step fails, notify analyst and continue with partial data."
Use decision trees, not magic wands
The goal isn't fully autonomous response. It's augmented decision-making. Runbooks should present options, not make irreversible choices:
Instead of: "Automatically disable user account" Do this: "Provide analyst with evidence + three response options:
- Disable account (explain impact)
- Force password reset (lower impact)
- Monitor for 24h (gather more evidence)"
Let humans make the final call on disruptive actions.
Measure runbook effectiveness
Track these metrics for each runbook:
- Execution count: How often is it triggered?
- Success rate: How often does it complete without errors?
- Time saved: Compare manual investigation time vs. automated time
- Analyst feedback: Do analysts trust the output?
If a runbook has <80% success rate or negative analyst feedback, iterate or retire it.
Common pitfalls (and how to avoid them)
Pitfall 1: Over-reliance on APIs
Problem: Runbooks break when third-party APIs change or rate-limit you.
Solution: Build graceful degradation. If VirusTotal API fails, skip that step and continue with local analysis.
Pitfall 2: No version control
Problem: Someone tweaks a runbook, it breaks, and nobody knows what changed.
Solution: Store runbooks in Git. Require pull requests for changes. Tag production versions.
Pitfall 3: Forgetting about compliance
Problem: Your runbook collects PII or accesses sensitive systems without proper logging.
Solution: Audit runbook actions. Log who triggered it, what data was accessed, and when. Ensure compliance with data handling policies.
Tools of the trade
You don't need a $500K SOAR platform to start. Here's our recommended progression:
Stage 1: Scripts (Weeks 1-4)
- Python scripts analysts run manually
- Store in Git, document inputs/outputs
- Validates the logic before investing in automation
Stage 2: Scheduled jobs (Months 2-3)
- Cron jobs or Lambda functions
- Automated enrichment of high-priority alerts
- Results written to shared location (S3, SIEM case notes)
Stage 3: SOAR integration (Months 4+)
- Move proven runbooks into SOAR platform
- Add interactive prompts and approval workflows
- Scale to more complex use cases
Start simple. Prove value. Then scale.
Real-world example: Suspicious login runbook
Let's walk through a runbook we built for a fintech client:
Trigger: SIEM alert "Impossible travel" (user logged in from two countries <1 hour apart)
Steps:
- Query SIEM for user's login history (last 30 days)
- Pull user details from Okta (MFA status, device trust)
- Check if source IPs are known VPN endpoints
- Retrieve user's manager from HR system
- Check for recent service desk tickets from this user
Outputs:
- Jira ticket with:
- Map of login locations
- User context (job role, manager, MFA status)
- Recommendation: "Likely VPN hopping" or "Investigate further"
Time saved: Manual investigation took 15-20 minutes. Runbook completes in <60 seconds.
Analyst feedback: "Game changer. I trust the data and can focus on real threats."
Conclusion
Good runbooks don't replace analysts - they free them from toil so they can focus on complex investigations and proactive threat hunting.
Start small, measure everything, and iterate based on analyst feedback. In 6 months, you'll have a suite of runbooks that actually work.
Need help building automation into your SOC? Talk to our team about embedded Forward-Deployed Engineers.