On January 12th at 14:09 UTC, our engineers began responding to alerts for our click / open tracking services and primary MongoDB clusters located in one of our production regions. Within ten minutes, the engineering team had determined that our MongoDB secondary servers, which are used for most read operations, were under heavy load and failing to serve requests in a timely manner. This issue was also responsible for the elevated errors with our tracking services. While reviewing our metrics, we observed a gradual increase in connections to these servers that we determined to be the cause of the performance degradation.
This increase in connections closely coincided with several MongoDB configuration changes that were made in the environment during the previous day. The engineering team chose to rollback these changes beginning at 15:15 UTC by preparing and deploying several new pull requests.
Due to human error, we both reversed these changes and inadvertently introduced a new change that caused our API and SMTP services to move from using the secondaries to the primary. As a result, our MongoDB primary server was unable to keep up with the request volume and started experiencing stability issues. This caused our API, SMTP, and website to begin experiencing an elevated error rate beginning at 15:55 UTC.
Between 16:00 and 19:00 UTC, the engineering team continued troubleshooting and made several additional changes, including resizing our MongoDB servers to improve our ability to handle the larger number of connections and deployed connection limits at the server level to try to help stabilize the servers while searching for a root cause. These changes allowed us to serve approximately 75% of our typical request throughput for our API in this region while we continued to investigate the underlying issue.
After review of our MongoDB configurations, we deployed limits on the number of connections our services could establish to our primary MongoDB server. We prepared these changes and began deployment at 20:28 UTC along with re-deploying our configuration changes to prefer secondary servers for reads. These configuration changes stabilized our MongoDB server and resulted in our services in this region resuming normal operations at 20:35 UTC.
After the incident, the engineering team recovered the logs from our MongoDB cluster in order to determine a root cause for the increase in connections. Our logs indicated a broad degradation of the cluster performance and did not reveal any clear set of queries that caused the cluster’s performance to begin degrading or the connection count to rise. While this is unfortunate, we believe that the connection limits we imposed on each of our services that rely on this database will prevent a similar issue from occurring in the future.
Deployment of Redundant Tracking Infrastructure (Complete) - Our tracking services have previously operated out of a single region. As part of previously planned infrastructure upgrades, our tracking infrastructure now operates across two independent regions.
Connection Limits (Complete) – During this incident and after the incident had concluded, we rolled out changes across all of our services that limited the number of connections that could be opened to limit the damage a single service could cause to other services that rely on this database.
Increased Size of Primary Mongo Cluster (Complete) – We made a permanent increase in the size of our primary Mongo cluster and rolled this change out to our other data centers.
Improved Failure Handling for Tracking Services (Planned) – We will be deploying improvements to our tracking services that will allow click redirection to still operate properly in the event of the failure of downstream services.