RMM - Zinfandel - Degraded performance and Internal Server Errors when navigating/logging in to the Web Interface
Incident Report for Datto
Postmortem

On 18th April 2024 at 18:18 UTC, partners on the Zinfandel Platform (US West) experienced slow loading times and internal server errors when attempting to login to or navigating the Web Interface.

 

The root cause for this service interruption was a performance degradation of the main database servicing the Web Interface.

 

The reason for this was a networking issue preventing the routine process of monitoring for and remediating load balancer instance issues. Due to unhealthy load balancer instances remaining in the pool instead of being replaced, all regular request volume was therefore concentrated through the small number of remaining healthy instances causing the main database to be overwhelmed and run out of processing resources.

 

Auto-scaling elsewhere in the infrastructure began shortly after the incident alerts were received, and it started recovering the service. This automatic process was also expedited through manual intervention by our R&D Team, and the issue was resolved by 19:07 UTC.

 

In order to mitigate the risk of functional impact if a similar event occurs in the future, we are reviewing our options to implement further monitoring and alerting related to the load balancer network itself.

Posted Apr 24, 2024 - 14:11 UTC

Resolved
This incident has been resolved.
Posted Apr 18, 2024 - 19:30 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 18, 2024 - 18:10 UTC
Investigating
Our teams are currently investigating 50X errors for Datto RMM on Zinfandel.

Thank you for your patience!
Posted Apr 18, 2024 - 17:42 UTC
This incident affected: Datto RMM (Zinfandel (US West)).