My Love/Hate Relationship with Cloud Custodian
I’m a huge fan of the Cloud Custodian tool. If you hear the name for the first time - it’s an open-source rules engine for cloud security, cost optimization, and governance.
As I have deployed Custodian at work for over a year, I think it’s a good enough period to describe the experience - especially my love/hate relationship with the tool.
Note: I use Custodian’s cloudtrail execution mode to protect AWS accounts. Some advantages and disadvantages might not apply to you when you use other modes or other cloud providers.
- Love Story
- Open Source and good community support
- Detects and can auto-mitigate issues in real-time
- Supports customizable policies
- No open source tool to compete with Custodian
- Hate Story
- Lacks documentation and examples
- More AWS accounts could mean more setup chaos
- Creating custom notification messages can be tricky
- Resource-based policies are not always fun
- Other open source software issues - maintenance, upgrade, monitoring, etc
Just imagine a tool in your cloud security arsenal that has the following:
- Open Source with community support
- Can detect misconfigurations in near real-time or at periodic intervals
- Supports auto-mitigation (both in real-time and periodic)
- Integrates with your enterprise messaging tool’s webhooks to send alerts
- Cost effective (<100 USD for 1000s of resources in 20+ AWS accounts)
- Customizable detection rules as per your requirement
If you liked the specifications, you might just fall in love with Cloud Custodian.
Cloud Custodian is a rules engine for cloud security, cost optimization, and governance. Since it’s an “engine,” its efficiency depends on the “driver” implementing it (pun intended 😛).
Custodian requires all misconfiguration checks defined as YAML files. Since these YAML files (aka policies) are plaintext files, you can have Compliance as Code along with your Infrastructure as Code. You can streamline the process of Custodian policy deployment using CI/CD pipelines.
Talking about misconfiguration detection in AWS, there are tons of open-source tools out there. Custodian stands out among its peers as it allows us to define how a misconfiguration must be mitigated (in the policy itself). The mitigation is automatic and done in near real-time when using Custodian’s Cloudtrail execution mode. This feature is a game changer.
To put it in a single sentence: Cloud Custodian is a great tool in your Cloud Security CSPM arsenal.
Also, this tool has an unfair advantage - there’s no open-source alternative to it.
I have come across a few similar tools, but they didn’t make it close to Custodian:
- airbnb/streamalert - a serverless real-time framework with alerts based on the detection logic you define. The tool is in its maintenance mode without new releases. The creator of this tool has left AirBnB to found Panther.com
- CRED-CLUB/DIAL - DIAL (Did I Alert Lambda?) is a centralized security misconfiguration detection framework. The tool has not got any updates, covers a handful of AWS services, and is a bit tedious to set up.
Cloud Custodian’s policies are easier than AWS Config rules.
For example, to fetch all S3 buckets which don’t have encryption:
The above love story shows the advantages of Cloud Custodian. But don’t let that fool you into believing it’s the best tool. It has its fair share of disadvantages.
After deploying the tool in all AWS accounts at work (both prod and non-prod) and maintaining it for over a year, I faced some situations that are neither documented nor talked about in the cloud security community.
Lack of documentation is by far the major concern. The existing documentation is good but not good enough. It covers the most common setup guides, sample policies, and tools like
If you are looking out for simple misconfiguration policies like “S3 bucket encryption missing”, “IAM user having console access without MFA”, or “Public EBS snapshots”, etc - you can find them online.
But if you are trying to implement any policy other than the most simple and obvious ones, you will have to figure it out yourself.
You need to check the schema of the resource using
custodian schema aws.<service> and figure out which action/filter would be helpful for your requirement.
Also, you can’t find a single repository containing all Custodian policies for CIS benchmarks or similar. A repository like that would make the life of a Cloud Security Engineer more effortless.
Tools like CloudQuery or Steampipe win over Custodian (in detection) because they have community-curated checks in a single repository.
If you have a single AWS account, setup is super easy.
If you have a few accounts under an AWS organization, the setup is easy. You can take the help of
c7n-org to deploy lambdas across accounts for near real-time alerts/auto-mitigations.
The slight complexity with the deployment of
c7n-mailer in a multi-account setup.
c7n-mailer is a utility that allows Custodian policies (deployed as lambdas) to send notifications when it detects misconfigurations.
c7n-mailer utility needs:
- an SQS queue - to get messages from policies deployed as lambdas
- a lambda function - to parse the messages in the queue and send notification
One needs to take care of the following to make a “secure” deployment of the security tool:
- To reduce complexity, have a central
c7n-mailerSQS queue. You need to make sure your AWS accounts can send messages to the queue while other AWS accounts outside your organization can’t.
c7n-mailerstores your Slack webhook/token, Splunk token, etc as plain text in the lambda function. You need to encrypt those sensitive data with a KMS key before storing it in code.
There’s no automated way to deploy the above-mentioned secure setup. One has to make an effort to search & read documentation, make mistakes when deploying, and finally get it working.
This issue occurs whenever you are implementing a central component that your Custodian lambda functions interact with.
Custodian supports sending notifications using
c7n-mailer as mentioned above. The notifications can be sent over Email, Slack, Splunk, etc.
If you want to use the default messages, then you are fine.
You will have a hard time in case you want to add some text formatting to the message / change the way non-compliant resources are displayed in the notification.
Debugging Jinja2 templates is tricky and very tedious. Also, if you mess up the templates, you will not get notifications.
Cloud custodian policies are resource-based. A resource refers to a cloud service or its component that can be managed through policies.
There can be a resource covering an AWS service and a few more resources covering specific parts of the service. For example, Custodian resources for S3 includes
While resources give granularity to the policies we create, at times this can mean we have to create multiple policies for a simple task.
Let’s consider this simple task - alert me whenever there’s a GuardDuty finding.
Custodian doesn’t consider GuardDuty or its findings as a resource. Instead, it considers GuardDuty as an execution mode under each resource like EC2, S3, EKS, etc.
So if I need to alert all GuardDuty findings, I have to create multiple policies for each AWS service - IAM, S3, EKS, etc.
If these policies are deployed as lambda functions, then there would be multiple lambda functions for a single task.
This issue is not something specific to Cloud Custodian but OSS in general. When using Custodian, you must:
- Maintain it - Regularly upgrade Custodian CLI to use new features and AWS services added to it. Write and test new policies and optimize existing ones (maybe even ignore a few cloud assets in policies where the business has accepted the risk). Onboard new AWS accounts and remove old ones.
- Monitor it - One bad notification template for
c7n-maileris all that it takes stop notifications. You would need to set up monitoring to make sure your Custodian setup is working.
- Expect delayed or no support for issues - You might not get an answer for your issue immediately. You might not get a fix for your bug in the new release. (I admire Kapil Thangavelu’s effort to look into Custodian issues and respond, but I don’t think he alone can answer the queries of many people and fix bugs on priority.)
There is one issue specific to this tool.
If you want a management report like how many misconfigurations were detected and mitigated in the last X months using Custodian, that’s impossible. Custodian doesn’t come with any UI. You have to build it on your own (which takes time).
Cloud Custodian is a great tool in your cloud security defense arsenal. I will continue to recommend the tool till there’s a better open-source alternative.
If you are looking for a tool that would detect and auto-mitigate issues in lesser than 1 minute, there’s nothing like Custodian.
When I started exploring and using Custodian no one told me about the disadvantages. Nor did I find any blog posts like this one. I hope this blog post helps you get a better picture of the pros and cons of using Custodian.
Knowledge comes from reading documentation. Wisdom arises from hands-on experience.
I have spent months exploring, trying, and testing open-source tools to defend AWS and evaluating them in production environments. While guides and tutorials give you a partial picture of tools, I have had the opportunity to battle-test them in the real world.