Database Emergency Response Plan: The 3 Cs

Preventing and responding to database performance problems is a critical component of our responsibilities, and we need to have a solid game plan to be effective. This document is a general guide to dealing with database performance issues. It is a living document, and as we encounter and deal with emergency situations, we will continue to build up our toolbox and experience.

  1. Notice the CLUES
  2. Diagnose the CAUSE
  3. Attempt a CURE

Notice the clues

The first step is always to notice the clues that point to a performance issue; this will help us determine the appropriate response to the problem. In some cases, multiple metrics show issues, so we need to build up a bank of responses based on combinations of clues and individual ones to understand which clues represent urgent, critical problems.

Almost all signs of impending or current problems with a database's performance can be observed on the AWS Console (either through CloudWatch metrics or RDS Performance Insights). These metrics should have alarms configured to notify the team as soon as symptoms appear, allowing for a timely and effective response.

Signs that there is a problem

  • High CPU utilisation
  • High DB load (CPU vs non-CPU)
  • High connection count
  • High wait load
  • Worst-case scenario: noticeable performance degradation for users

These indicators are relative to the normal operating state of each specific project. Therefore, it's not always possible to prescribe exact thresholds. Each project team should analyse its baseline workload—factoring in expected peak periods— and determine sensible alarm limits for these metrics.

It's important to note that each symptom can stem from a variety of underlying causes. For instance, high connection counts often correlate with high CPU usage, but the relationship is not always causal. When one of these metrics triggers an alarm, it's essential to investigate thoroughly to determine the root cause, rather than treating the symptom in isolation.

Diagnose the cause

Once we have the various clues to work from, the next step is to find the underlying issue causing the problem. None of the clues are directly fixable themselves, so we can only fix the issue once we find the cause.

Each clue points to different possible causes. To diagnose effectively, we need to examine combinations of clues and eliminate irrelevant ones. In most cases, we'll run diagnostic queries on the database, although sometimes the root issue can be spotted without scripts.

Potential causes:

  • Lock contention
  • Long-running transactions
  • Expensive queries
  • Idle connections
  • Vacuum is not running or running continuously
  • Redis failing (e.g. out of memory)
  • PostgreSQL is in recovery mode

Lock contention

Lock contention occurs when multiple queries try to acquire the same or conflicting locks on rows. It's usually quick to diagnose using the following query:

SELECT activity.pid, activity.usename, activity.query, blocking.pid AS blocking_id, blocking.query AS blocking_query FROM pg_stat_activity AS activity JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(activity.pid)) ORDER BY blocking_id;

If one query is blocking others, you'll need to evaluate whether it's safe to cancel it. This action can result in data loss, so it should only be used as a last resort.

Long-running queries

To identify the longest-running active queries:

SELECT pid, datname, usename, application_name, client_hostname, client_port, backend_start, query_start, query, state FROM pg_stat_activity WHERE state = 'active' ORDER BY backend_start ASC;

For slow queries, use EXPLAIN ANALYSE to inspect query plans.Note: Using ANALYSE will execute the query, so use it only with SELECT statements. For UPDATE,INSERT, or DELETE, use EXPLAIN only.

Expensive queries

Expensive queries aren't always slow, but they consume excessive resources, which can starve other processes. In severe cases, they fill the instance's buffer memory and lead to a snowballing effect.

One past incident involved bufferIO wait events increasing even after cancelling problematic queries. We eventually had to restart the reader instance—a last-resort measure.

Idle connections

Idle connections still use memory, and too many of them can degrade performance. Use this query to find them:

SELECT pid, usename, pg_stat_activity.query_start, now() - pg_stat_activity.query_start AS query_time, query, state, wait_event_type, wait_event FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > INTERVAL '5 minutes';

These will usually report as idle—indicating held but unused connections. This can happen for various reasons, including RDS Proxy pinning. A consistently higher idle count may signal a deeper issue.

Vacuum not running or running continuously

Tables need regular vacuuming to avoid bloat and keep indexes working effectively. Busy or large tables may not be vacuumed often enough, or the vacuum may take too long to complete.

To see vacuum stats for each table:

SELECT relname AS TableName, n_live_tup AS LiveTuples, n_dead_tup AS DeadTuples, last_autovacuum AS Autovacuum, last_autoanalyze AS Autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;

To check if vacuum processes are stuck:

SELECT datname, usename, pid, state, wait_event, current_timestamp - xact_start AS xact_runtime, query FROM pg_stat_activity WHERE upper(query) LIKE '%VACUUM%' ORDER BY xact_start;

Redis falling over

When Redis runs out of memory or crashes, connections may timeout—slowing API calls and increasing load on the database. Use the CloudWatch Infra dashboard to monitor Redis memory usage.

PostgreSQL is in recovery mode

Run the following to check if an instance is in recovery:

SELECT pg_is_in_recovery();

If true, the instance is a standby (reader) or is replaying WAL logs after failure.Important: This query will always return true on a reader. Make sure you're querying the writer instance for accurate diagnostics.

Attempt a cure

The specific cure will depend heavily on the problem. Some will require a small response, while others might only be solved by restarting the database.

There are two types of responses to impaired database performance: immediate and delayed.

Immediate responses carry more risk and consequences, but sometimes they're unavoidable. When transactions or memory issues cause the database to become unstable, we may need to:

  • Cancel or kill idle/blocking transactions
  • Restart the database (in extreme cases)
  • Optimise troublesome queries
  • Add missing indices
  • Rework application logic

Cancel a query

Cancelling or killing a query is a drastic step. Stopping a data-modifying query can lead to data loss, and even cancelling a SELECT query might cause related operations (like POST or PATCH) to fail. We should only cancel or kill a query when we are certain it relates to a GET request or something that won't compromise data integrity.

However, if a problematic data-modifying query is stuck and is causing others to queue behind it, we may need to consider terminating it. There are no hard rules—each case must be assessed on its own.

Always copy the query text before cancelling or terminating a process so we can trace affected objects afterwards.

To cancel a query, get the process ID and run:

SELECT pg_cancel_backend(<pid>);

If cancellation fails or takes too long, you may need to terminate the process instead:

SELECT pg_terminate_backend(<pid>);

Only use pg_terminate_backend if cancelling repeatedly fails. This command has fewer safeguards and should be used with caution.

Restart the database

In rare cases, restarting the database is the only way to restore functionality. This often happens when expensive queries fill up the buffer memory and IO wait events continue growing, even after the original queries are cancelled.

If you reach this point, ensure that the offending queries are identified and fixed as soon as possible — otherwise the issue will simply recur. By the time we consider a restart, root cause analysis should already be complete.

Optimise troublesome queries

Query optimisation is a deep topic. At a basic level:

  • Run EXPLAIN or EXPLAIN ANALYSE (note: ANALYSE executes the query)
  • Look for expensive operations in the execution plan
  • Rewrite the query to avoid costly steps, where possible

Sometimes there's no “better” way to write the query, in which case adding an index or adjusting the application logic may be the only solution.

Add indices

A well-placed index can drastically speed up a slow query. But be cautious— adding too many indices or the wrong index can confuse the planner.

Always inspect the query plan first using EXPLAIN ANALYSE before creating an index.

Important: Always use CREATE INDEX CONCURRENTLY to avoid locking the table, especially in high-traffic write-heavy environments.

For detailed guidance and examples, refer to the indexing best practices guide.

Rework application logic

In some edge cases, the only solution is to change how the application interacts with the database. This might mean batching queries, pre-calculating results (rollups), or caching certain results.

Each scenario requires a tailored approach, depending on the specific problem and data access patterns.