facebook noscript

PostgreSQL Deadlock Monitoring in AWS

June 19, 2019
engineering-default

When it comes to a robust, open-source, and fairly secure database system, Amazon RDS PostgreSQL monitoring tools is often a top choice for developers. That’s why we wanted to get under the hood to better understand the PostgreSQL monitoring tool and its performance.

Many businesses utilize AWS and, in turn, RDS. In order to ensure that your customers are getting the most out of their user experience, you have to make sure that things are running smoothly under the hood of your website or application. Let’s take a look at the current setup and see how our tools can help you establish the proper precautions while monitoring performance.

In our first extended look at monitoring Amazon’s Relational Database Service (RDS), we will cover basic RDS monitors, the Amazon Web Services (AWS) Lambda specifics, and our new monitoring tool - the RDS Alarms Lambda. In the next article, we will show you how we deploy the RDS Alarms Lambda and the gluing parts at the same time.

Why look at PostgreSQL monitoring tools? One example:

Imagine this: It is a normal, quiet day with no deployments or planned maintenance. Everything is going smoothly.

Suddenly, you notice a dramatically increasing latency on your sandbox app:

rds-monitoring-pic1

We quickly figured out that the app is dropping DB connections:

rds-monitoring-pic2

But there are no other alerts, and RDS is all green.

Later, you discover that the source was an engineer running pg_dump against sandbox DB to check data migration. And it caused a PostgreSQL deadlock. But how?

From pg_dump main page:

pg_dump is a utility for backing up a PostgreSQL database. It makes consistent backups even if the database is being used concurrently. pg_dump does not block other users accessing the database (readers or writers).

Or does it?

You may have gotten this response:

pg_dump takes a reader's lock (AccessShareLock). This does not prevent reads or writes of the table. It does prevent dropping the table, making schema alterations to it (such as column add/drop), or anything else that wants an exclusive lock.

Gotcha! There must be something that takes an exclusive lock in the app, and it must have happened at the same moment as the pg_dump.

You’ve learned your lesson, exclusive locking found and fixed, pg_dump will never again be used against a customer-facing DB.

However, this has unveiled another issue: You still have no clue what's going on at the DB level. There was a PostgreSQL deadlock for about 20 minutes and no real-time alerts from PostgreSQL RDS/CloudWatch. How can you address that? Let's start with what you already have in AWS.

Basic PostgreSQL RDS Monitoring

Whenever you open your RDS, you get a standard dashboard, like this:
rds-monitoring-pic3

You also get a set of CloudWatch metrics

rds-monitoring-pic4

We use these for CloudWatch Alerts for performance metrics, like low storage, high CPU, connection, disk usage, buffer cache, and more.

We also use RDS events notifications that go through SNS to Slack:
rds-monitoring-pic5

There's also Enhanced RDS Monitoring:[https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html] and RDS Performance Insights [https://aws.amazon.com/rds/performance-insights/].

A lot comes in the box, but it is clear that basic RDS performance monitoring is still not enough, as it cannot address deadlocks in real-time.

NOTE: RDS Performance Insights and metrics will actually show you slow locks, but we didn't find a way to provide alerts based on this analysis.

Anyways, this feature hadn’t been released when this article was first published.

Extensions and Alternatives to CloudWatch

In our research, we did not look for heavy APM, monitoring solutions like DataDog and NewRelic. We already use NewRelic in a limited way, and extending the subscription or buying the new product for this single alert seemed less-than-optimal. So we needed something lightweight, preferably something open-source so that we can fork and tune.

The first thing that came to mind was to parse RDS log. At VGS, we use fluent-based logging pipelines, so it should be easy to connect the RDS log stream to our log-forwarding stack. In RDS we'll need to tune:
"log_lock_waits", "log_statement, "log_min_duration_statement". Some combination of these may work, and we’ll get the desired “ERROR: deadlock detected”. However, we've found out that making these more verbose also slows down the database. We also needed to figure out how to be alerted when the deadlock stops to emit log records.

So we then tried the statistic collector view code pg_stat_activity view. With a couple of queries, this is exactly what we need. We didn’t find any good open-source candidates to take, so we built our own. Here's what we've come up with so far: https://github.com/verygood-ops/rds-alarms-lambda/blob/master/rds_alarms_lambda/deadlocks.py. Let’s pack this script into a nice-looking tool and run it via cron. Or should we use cron in 2019?

Don't miss the next Developer Office Hours with our CTO

Join Us

AWS Lambda

Why AWS Lambda?

  1. Because an AWS RDS monitoring tool is not an instance. We’re trying to keep instance count low, because it’s always a maintenance burden.
  2. Because it's cheap: when called once a minute, this is 43200 calls per month, lower than AWS free tier.
  3. Because it can have tightened access to RDS: Security Groups and Subnets let it connect only a single RDS instance and nothing more.
  4. Because it's very transparent and easy to monitor. We’ll show exactly what we mean in a minute.
  5. It has Python runtime. Python has everything we need to build a framework for future extensions, and it’s widely used in VGS.

Cool, but how do we pack a heavy PostgreSQL driver into a lambda, how much resources will it need, and how can we call it once a minute to ensure that it is not stopped?

Challenges

If you build a lambda archive on mac, the driver will compile accordingly and AWS Lambda runtime (AFAIK Ubuntu) will not accept it. So we came up with a Docker-based build: https://github.com/verygood-ops/rds-alarms-lambda/blob/master/ops/package.sh .

We also tweaked the python main script to hide that huge dependency tree inside "vendor" dir (hello golang!) https://github.com/verygood-ops/rds-alarms-lambda/blob/master/main.py#L4-L5.

The Next problem we noticed was an absence of python logs:

rds-monitoring-pic6

The internet suggests that you should re-attach log handlers after the lambda init. OK, we have hacked this too: https://github.com/verygood-ops/rds-alarms-lambda/blob/master/rds_alarms_lambda/lib.py#L19-L23

rds-monitoring-pic7

Nice!

In terms of resources and performance, we went ahead with default 128Mb RAM, whatever CPU (calculated automatically) and 30-sec max execution timeout.

And here it is:

rds-monitoring-pic8

Even better than we expected.

Gluing pieces - CloudWatch timer, SNS.

The next issue was periodic calls. Lambda itself does not have a schedule. However, you can create a CloudWatch Event Rule (which even supports cron notation) and set: https://docs.aws.amazon.com/lambda/latest/dg/with-scheduled-events.html.

rds-monitoring-pic11
Cool.

Amazon RDS Postgresql Monitoring went easy too, we used 2 standard CloudWatch metrics for Lambda: Errors and Invocations.
rds-monitoring-pic9

Errors should always be 0, invocations - 1 per minute, easy.

Now alerts. Lambda log is great, but how do we report findings to Slack/PagerDuty? In VGS, we often use SNS for error routing. We already have a bunch of topics connected to different destinations/alert levels. So with SNS, how do we pass creds? Similar to EC2, Lambda supports AWS metadata service.

Metadata service is one of the credentials providers in boto library, on its turn, boto is already included into Lambda python runtime so you don’t even need to vendor it. Given all the above, once you set up lambda’s IAM role properly, you simply call boto without giving it any secrets and it can post to SNS.

Done. Important messages are being logged and additionally sent to SNS. Also, both lambda monitoring alarms report the same SNS topic. This means that, when something goes wrong, we see it in one place.

rds-monitoring-pic10

Finally, we have a deadlock alert!

Next Steps

Lambda code is reusable for any queries, so it can be extended for health checks, to monitor the consistency of loosely-coupled schemas, look for slow queries (which we already do in deadlocks.py), run db replications, and anything you can do through a query in RDS.

The code is distributed under MIT and posted on our GitHub: https://github.com/verygood-ops/rds-alarms-lambda

With these AWS RDS monitoring tools, you can help ensure that your site or app runs smoothly for the benefit of both you and your customers and better review your database performance. If something wrong does arise, you’ll easily be able to locate the problem and quickly make appropriate corrections. Take proper precautions in terms of monitoring and alarms, because downtime isn’t good for anyone.

In the next article, we will cover CloudFormation and the troposphere deploy process we use in VGS infra, as well as show the stack we created for this RDS Lambda monitor.

Max Lobur Max Lobur

DevOps Engineer

Share

You Might also be interested in...

case-studies-default

Zero Data Hero Customer Spotlight - Truebill

Peter Curley June 24, 2019

case-studies-default

Zero Data Hero Customer Spotlight - Streamline

Peter Curley June 17, 2019

engineering-default

VGS Collect: Providing Accessibility for Everyone

Anna Kudriasheva June 11, 2019