Tolga Talks Tech: Managing your AWS Environment with AWS Systems Manager

Tolga Tarhan

Managed Services
June 3, 2019

Tolga Talks Tech is a weekly video series in which Onica’s CTO Tolga Tarhan tackles technical topics related to AWS and cloud computing. This week, Tolga discusses AWS CloudFormation, AWS Lambda and AWS Systems Manager with Brandon Pierce, Engineering Director at Onica. For more videos in this series, click here.

I want to talk about how we deal with customer environments that we’re inheriting as part of onboarding to MSP that obviously were deployed all kinds of different ways, all kinds of different tools and you as an MSP team have to manage these. What’s your strategy for that?

For these environments that we’re inheriting, we actually deploy into accounts what we call suit of overlay tools. Overlay tools allow us to manage back up, DR, monitoring and logging for existing workloads.

Give me the example of what that looks like? How do these tools work?

These tools are deployed into different environments. They’re off by default. They are non-intrusive by default. Customers can opt-in by adding tags or additional configuration parameters to their workloads to opt-in to the services these tools are providing.

You mentioned, for example, alarm enhancement. How would a tool like that work tangibly?

Since these environments have resources that have already been deployed into them, we’ll actually scan environments for certain types of EC2 or RDS resources, collect out of the box metrics that exist for them and then create alarms with reasonable defaults for that. We can then use AWS Systems Manager Parameter Store to define workload specific thresholds or criteria for triggering these alarms.

So you can take an environment that was maybe deployed without consideration for monitoring or maybe that was hand ruled and be able to overlay robust alarming and monitoring.

Correct. Without interrupting a customer’s deployment model, redeploy anything, update their AWS CloudFormation templates, terraform or whatever it may be.

These alarms often follow some similar patterns like a CPU or load average. Do we have anything that can take that further?

Yes. AWS does provide a great metric and alarming framework for us. But some of the context is lost when these alarms go off. You can think of it as, if you’re an on call engineer or SRE, receiving an alert for some arbitrary instance ID or resource ID and some metric that’s off, that’s great. What does that mean? You inevitably have to log into systems, collect additional information and try to figure out what’s going on. Sometimes by the time you’re able to react, you’ve lost a lot of that information. So some of these tools can actually determine that given this type of alarm, go collect some supplemental information. So a really common pattern is EC2 instances and the CPU, we can then react to that, automatically log into those systems, regardless of their OS, and start collecting information immediately and then pass that on to the on-call engineer.

You get all this value and all the capability on top of existing environments by just deploying a few tools. Besides alarming in what other areas do we build tools like this?

We have others around DR and backups. So we can actually take existing Snowflake instances sort of standalone EC2 resources that traditionally would not have HA or good DR capability and actually provide that to those resources in place and provide things like cross region and cross AZ replication for them.

How does Onica build and release these tools?

Rather than using third party solutions or rebranding existing solutions, we actually built all of this on Python, AWS CloudFormation and other AWS native technologies like AWS Lambda and AWS Systems Manager.