✓ RESOLVED

Database Connection Pool Exhaustion

January 10, 2026 • Duration: 29 minutes • Users affected: 12,400 • Severity: P1

📋 Executive Summary

On January 10, 2026, our API experienced 29 minutes of degraded performance due to database connection pool exhaustion. Approximately 12,400 users were affected, with 23% of requests failing at the incident peak. The root cause was a connection leak introduced in a recent deployment of our user synchronization service. When errors occurred, database connections were not properly released, leading to pool exhaustion under normal traffic load. We rolled back the affected service within 15 minutes of identifying the issue. All systems returned to normal operation by 14:52 UTC. We are implementing additional safeguards including automated connection leak detection and updated deployment checklists to prevent recurrence.

⏱️ Timeline

14:23 UTC [PagerDuty] CRITICAL: API response time > 5000ms

14:24 UTC [Datadog] Alert: PostgreSQL connection pool at 98% capacity

14:25 UTC [Slack] @sarah acknowledged incident, investigating

14:27 UTC [AWS CloudWatch] RDS CPU spike to 87%, connections maxed at 100

14:28 UTC [Slack] @sarah: 'Looks like connection leak from the new user-sync service'

14:31 UTC [Datadog] Error rate: 23% of requests returning 503

14:33 UTC [Slack] @mike joined call, reviewing user-sync PR from yesterday

14:35 UTC [GitHub] Identified: PR #2847 missing connection.close() in error handler

14:38 UTC [Slack] Rolling back user-sync to v2.3.1

14:42 UTC [Datadog] Connection pool dropping: 78% → 45%

14:47 UTC [Datadog] Error rate: 0.1%, API latency normal

14:52 UTC [PagerDuty] Incident resolved, all systems nominal

🔍 Root Cause

A connection leak in the user-sync service deployed in PR #2847. The error handling path failed to release PostgreSQL connections, causing pool exhaustion under load. Specifically, the catch block in the syncUserData() function was missing a connection.release() call, which meant any failed sync operation would permanently consume a connection from the pool.

🛠️ Mitigations

Rolled back user-sync service to previous stable version (v2.3.1)
Increased connection pool timeout to 30 seconds to prevent cascading failures
Manually cleared stuck connections via pg_terminate_backend()
Enabled enhanced connection pool monitoring with 80% threshold alerts

✅ Action Items

HIGH Add connection.close() in all error handlers across user-sync service

HIGH Add connection pool monitoring to deploy checklist

MEDIUM Implement connection leak detection in CI pipeline

MEDIUM Create runbook for connection pool exhaustion incidents

LOW Add integration tests for connection handling edge cases

💬 To Our Users

We sincerely apologize for the disruption to your service. We understand that reliable access to your data is critical to your operations, and we fell short of that expectation today. Our team is committed to implementing the action items above to prevent this from happening again. Thank you for your patience and continued trust in our platform.

📰 Hacker News Post (Ready to Copy)

Show HN: Our database connection pool incident — full postmortem

Yesterday our API went down for 29 minutes. Root cause: a missing connection.close() in an error handler. Classic.

We're publishing our full postmortem because we believe in transparency. 12,400 users were affected when a code change in our user-sync service leaked database connections until the pool was exhausted.

Lessons learned:
- Always close connections in catch blocks
- Monitor connection pools, not just queries
- Rollback fast, investigate later

Full incident report with timeline, root cause, and action items: [link]

We hope this helps others avoid the same mistake.

🐦 Twitter Thread (Ready to Post)

1/6

🚨 Thread: Yesterday we had a 29-minute outage affecting 12,400 users. Here's our full, transparent postmortem. 🧵

2/6

At 14:23 UTC, alerts fired. API response times spiked to 5+ seconds. Error rate hit 23%. Our on-call engineer jumped in immediately.

3/6

Root cause? A missing connection.close() in an error handler. One line of code, deployed the day before. When errors happened, connections leaked until the pool was exhausted.

4/6

We identified the issue in 12 minutes, rolled back in 3 more. By 14:52 UTC, all systems were nominal. Total: 29 minutes of degraded service.

5/6

Action items we're implementing: ✅ Connection handling audit ✅ Pool monitoring in deploy checklist ✅ CI checks for connection leaks ✅ New runbook for this failure mode

6/6

We're sharing this because transparency builds trust. Check out our full postmortem with complete timeline: [link] Hope this helps other teams catch similar issues before they become outages. 🙏

Download Pack Includes

postmortem.md
timeline.svg
hn-post.txt
twitter-thread.txt

This report was generated in under 2 minutes.

Paste your incident timeline and get the same professional output.

Generate Your Postmortem

Generated with IncidentPost — Professional incident postmortems in minutes.