Monitoring EmpowerID SaaS
This article serves as an introductory overview of the availability monitoring processes followed by EmpowerID in monitoring SaaS environments. While it does not delve into all the various aspects of Site Reliability Engineering or Security Information and Event Management performed by EmpowerID, the article provides a comprehensive understanding of the processes followed by the DevOps team to ensure a base level of service with minimal impact on end users.
The article focuses on availability monitoring, and the information provided aims to help SaaS customers understand the monitoring processes performed by EmpowerID and assist the in-house operations team of non-SaaS customers to achieve parity. EmpowerID's solution for availability monitoring can be divided into three areas: front-end services, back-end services, and the underlying infrastructure monitored by EmpowerID DevOps.
Front-End Monitoring
To monitor site availability, EmpowerID DevOps ensures that the main web applications load without any issues. For this purpose, Azure Monitor is utilized, and three specific URLs are checked every two minutes per Azure region. These URLs include:
- Core Login (
https://<core-domain>/WebIdpForms/Login/Portal
) - IAM Shop (
https://<iamshop-domain>
), if applicable - My Identity (
https://<myid-domain>
), if applicable
All requests are checked to ensure that they are successful. In case of three consecutive failures, a High-Priority alert is raised, which would be handled by the EmpowerID DevOps team.
In addition to active front-end monitoring, passive error rate monitoring is optionally performed for large user bases where the EmpowerID UI is frequently utilized. For this, the Azure Application Gateway provides a failed-requests metric, and if the error rate exceeds the 5% threshold and sustains for more than five minutes, a High-Priority alert is raised.
In the past, EmpowerID DevOps relied on automated web tests, which performed a sequence of activities in an automated way, such as simulating a user login. However, this facet of front-end availability monitoring is being retired, as it did not report anything novel above and beyond what the abovementioned monitoring provided. It is mentioned here in the event a need requiring such UI-driven process monitoring presents itself.
Backend-Monitoring
EmpowerID's identity lifecycle automation functionality is often the primary reason clients use the platform, and monitoring backend processes is critical for ensuring system functionality. EmpowerID stores all vital information, including process state information, in one database, enabling the use of a simple yet effective mechanism to report process health.
A stored procedure called Z_EmpowerID_Health checks process state information against predefined criteria and outputs a list of problematic conditions requiring attention. A complete listing of these health checks and their configurations is available at EmpowerID HealthCheck SQL Procedure.
To monitor this process, EmpowerID DevOps deploys a monitoring container that invokes the health-check procedure every five minutes and submits any reported problem conditions to Azure Monitor. A medium-priority alert is raised if a problem condition is reported consecutively in polling intervals. Therefore, EmpowerID DevOps ensures that all of EmpowerID's various backend processes are continually monitored to maintain overall system health.
Infrastructure Monitoring
EmpowerID SaaS is hosted on Azure, utilizing several products like Azure Kubernetes Services (AKS) and SQL Database (as-a-Service). EmpowerID DevOps monitors specific metrics for each service to proactively detect issues before they affect front-end and back-end services.
Metrics for SQL Database Monitoring
Some metrics monitored for SQL Database include:
- Remaining free space: Less than 15% raises a medium-severity alert
- Deadlocks: Over three deadlocks within ten minutes raises a high-severity alert
- CPU utilization: Average over 90% raises a medium-severity alert
A medium or high-severity alert is generated depending on the metric and threshold.
Alert Handling
EmpowerID utilizes Azure Monitor to aggregate metrics, evaluate rules, and raise alerts. Actions are configured in Azure Monitor to trigger alerts in Atlassian Ops Genie, which then pages EmpowerID DevOps personnel. Depending on the severity, EmpowerID manages these alerts in the following way:
- High-severity alerts: On-call personnel are paged regardless of the time of day, and escalations are followed up if the alert is not acknowledged.
- Medium-severity alerts: Personnel is paged during waking hours, allowing for a timely follow-up.