My Love/Hate Relationship with Cloud Custodian

Badshah

2023-04-10 2023-04-10 1701 words 8 minutes

/my-love-hate-relationship-with-cloud-custodian/cover-image.png

I’m a huge fan of the Cloud Custodian tool. If you hear the name for the first time - it’s an open-source rules engine for cloud security, cost optimization, and governance.

As I have deployed Custodian at work for over a year, I think it’s a good enough period to describe the experience - especially my love/hate relationship with the tool.

Note: I use Custodian’s cloudtrail execution mode to protect AWS accounts. Some advantages and disadvantages might not apply to you when you use other modes or other cloud providers.

TL;DR:

Love Story
- Open Source and good community support
- Detects and can auto-mitigate issues in real-time
- Cost-effective
- Supports customizable policies
- No open source tool to compete with Custodian
Hate Story
- Lacks documentation and examples
- More AWS accounts could mean more setup chaos
- Creating custom notification messages can be tricky
- Resource-based policies are not always fun
- Other open source software issues - maintenance, upgrade, monitoring, etc

Love Story

Just imagine a tool in your cloud security arsenal that has the following:

Open Source with community support
Can detect misconfigurations in near real-time or at periodic intervals
Supports auto-mitigation (both in real-time and periodic)
Integrates with your enterprise messaging tool’s webhooks to send alerts
Cost effective (<100 USD for 1000s of resources in 20+ AWS accounts)
Customizable detection rules as per your requirement

If you liked the specifications, you might just fall in love with Cloud Custodian.

Cloud Custodian is a rules engine for cloud security, cost optimization, and governance. Since it’s an “engine,” its efficiency depends on the “driver” implementing it (pun intended 😛).

Custodian requires all misconfiguration checks defined as YAML files. Since these YAML files (aka policies) are plaintext files, you can have Compliance as Code along with your Infrastructure as Code. You can streamline the process of Custodian policy deployment using CI/CD pipelines.

Talking about misconfiguration detection in AWS, there are tons of open-source tools out there. Custodian stands out among its peers as it allows us to define how a misconfiguration must be mitigated (in the policy itself). The mitigation is automatic and done in near real-time when using Custodian’s Cloudtrail execution mode. This feature is a game changer.

To put it in a single sentence: Cloud Custodian is a great tool in your Cloud Security CSPM arsenal.

Also, this tool has an unfair advantage - there’s no open-source alternative to it.

I have come across a few similar tools, but they didn’t make it close to Custodian:

airbnb/streamalert - a serverless real-time framework with alerts based on the detection logic you define. The tool is in its maintenance mode without new releases. The creator of this tool has left AirBnB to found Panther.com
CRED-CLUB/DIAL - DIAL (Did I Alert Lambda?) is a centralized security misconfiguration detection framework. The tool has not got any updates, covers a handful of AWS services, and is a bit tedious to set up.

Cloud Custodian’s policies are easier than AWS Config rules.

For example, to fetch all S3 buckets which don’t have encryption:

Custodian Policy:

1
2
3
4
5
6


policies:
  - name: s3-bucket-encryption-off
    resource: s3
    filters:
      - type: bucket-encryption
        state: False

AWS Config rule: https://github.com/awslabs/aws-config-rules/blob/master/python/s3_bucket_default_encryption_enabled.py

Hate Story

The above love story shows the advantages of Cloud Custodian. But don’t let that fool you into believing it’s the best tool. It has its fair share of disadvantages.

After deploying the tool in all AWS accounts at work (both prod and non-prod) and maintaining it for over a year, I faced some situations that are neither documented nor talked about in the cloud security community.

1. Lack of documentation & examples

Lack of documentation is by far the major concern. The existing documentation is good but not good enough. It covers the most common setup guides, sample policies, and tools like c7n-org, c7n-mailer, etc.

If you are looking out for simple misconfiguration policies like “S3 bucket encryption missing”, “IAM user having console access without MFA”, or “Public EBS snapshots”, etc - you can find them online.

But if you are trying to implement any policy other than the most simple and obvious ones, you will have to figure it out yourself.

You need to check the schema of the resource using custodian schema aws.<service> and figure out which action/filter would be helpful for your requirement.

Also, you can’t find a single repository containing all Custodian policies for CIS benchmarks or similar. A repository like that would make the life of a Cloud Security Engineer more effortless.

Tools like CloudQuery or Steampipe win over Custodian (in detection) because they have community-curated checks in a single repository.

2. More AWS accounts can mean more setup chaos

If you have a single AWS account, setup is super easy.

If you have a few accounts under an AWS organization, the setup is easy. You can take the help of c7n-org to deploy lambdas across accounts for near real-time alerts/auto-mitigations.

The slight complexity with the deployment of c7n-mailer in a multi-account setup. c7n-mailer is a utility that allows Custodian policies (deployed as lambdas) to send notifications when it detects misconfigurations.

This c7n-mailer utility needs:

an SQS queue - to get messages from policies deployed as lambdas
a lambda function - to parse the messages in the queue and send notification

One needs to take care of the following to make a “secure” deployment of the security tool:

To reduce complexity, have a central c7n-mailer SQS queue. You need to make sure your AWS accounts can send messages to the queue while other AWS accounts outside your organization can’t.
c7n-mailer stores your Slack webhook/token, Splunk token, etc as plain text in the lambda function. You need to encrypt those sensitive data with a KMS key before storing it in code.

There’s no automated way to deploy the above-mentioned secure setup. One has to make an effort to search & read documentation, make mistakes when deploying, and finally get it working.

This issue occurs whenever you are implementing a central component that your Custodian lambda functions interact with.

3. Creating custom notification messages can be tricky

Custodian supports sending notifications using c7n-mailer as mentioned above. The notifications can be sent over Email, Slack, Splunk, etc.

There are some sample notification templates in the code repository. Custodian uses Jinja2 templating engine to create the notification messages.

If you want to use the default messages, then you are fine.

You will have a hard time in case you want to add some text formatting to the message / change the way non-compliant resources are displayed in the notification.

Debugging Jinja2 templates is tricky and very tedious. Also, if you mess up the templates, you will not get notifications.

4. Resource-based policies are not always fun

Cloud custodian policies are resource-based. A resource refers to a cloud service or its component that can be managed through policies.

There can be a resource covering an AWS service and a few more resources covering specific parts of the service. For example, Custodian resources for S3 includes aws.s3, aws.s3-access-point and aws.s3-access-point-multi

While resources give granularity to the policies we create, at times this can mean we have to create multiple policies for a simple task.

Let’s consider this simple task - alert me whenever there’s a GuardDuty finding.

Custodian doesn’t consider GuardDuty or its findings as a resource. Instead, it considers GuardDuty as an execution mode under each resource like EC2, S3, EKS, etc.

So if I need to alert all GuardDuty findings, I have to create multiple policies for each AWS service - IAM, S3, EKS, etc.

If these policies are deployed as lambda functions, then there would be multiple lambda functions for a single task.

5. Other open-source software issues

This issue is not something specific to Cloud Custodian but OSS in general. When using Custodian, you must:

Maintain it - Regularly upgrade Custodian CLI to use new features and AWS services added to it. Write and test new policies and optimize existing ones (maybe even ignore a few cloud assets in policies where the business has accepted the risk). Onboard new AWS accounts and remove old ones.
Monitor it - One bad notification template for c7n-mailer is all that it takes stop notifications. You would need to set up monitoring to make sure your Custodian setup is working.
Expect delayed or no support for issues - You might not get an answer for your issue immediately. You might not get a fix for your bug in the new release. (I admire Kapil Thangavelu’s effort to look into Custodian issues and respond, but I don’t think he alone can answer the queries of many people and fix bugs on priority.)

There is one issue specific to this tool.

If you want a management report like how many misconfigurations were detected and mitigated in the last X months using Custodian, that’s impossible. Custodian doesn’t come with any UI. You have to build it on your own (which takes time).

Final Thoughts

Cloud Custodian is a great tool in your cloud security defense arsenal. I will continue to recommend the tool till there’s a better open-source alternative.

If you are looking for a tool that would detect and auto-mitigate issues in lesser than 1 minute, there’s nothing like Custodian.

When I started exploring and using Custodian no one told me about the disadvantages. Nor did I find any blog posts like this one. I hope this blog post helps you get a better picture of the pros and cons of using Custodian.

If you have any doubts, suggestions, or alternatives to this tool - feel free to DM me on LinkedIn or Twitter.

Shameless Plug

Knowledge comes from reading documentation. Wisdom arises from hands-on experience.

I have spent months exploring, trying, and testing open-source tools to defend AWS and evaluating them in production environments. While guides and tutorials give you a partial picture of tools, I have had the opportunity to battle-test them in the real world.

Kumar Ashwin and I have condensed our real-world experiences on Attacking and Defending AWS into a three-day hands-on training - “AWS Security Masterclass”.

If you are trying to step up your AWS Security or even get to know the latest tools and techniques in the market, register for the training at x33fcon, Gdynia, Poland.