we are seeing some degradation in INDIA region and actively working on it.
we are seeing some degradation in INDIA region and actively working on it.
Customer updates
the failure was caused by Azure SQL worker/request exhaustion on the SQL elastic pool in INDIA region. The issue is resolved and actively monitoring
we are seeing some degradation in INDIA region and actively working on it.
Post-incident report
The immediate cause of the incident was a failure in the internal autoscaling automation responsible for scaling regional
Azure SQL elastic pools. The automation service became unhealthy because one of its background maintenance
workers encountered memory exhaustion while processing a large recovery workload. This caused repeated application
process exits and health probe failures in the automation service runtime.
Because the autoscaling service was not healthy during the India incident window, scale-up actions that would normally
add database pool compute capacity were not completed in time. The affected pool therefore operated with insufficient
headroom while production traffic continued, causing elevated compute and worker utilization and resulting in slower
responses for some users.
The issue was not caused by a customer-side change, a specific tenant database, Azure SQL storage saturation,
transaction log saturation, or an Azure platform outage. The trigger was an internal automation service reliability issue
combined with insufficient regional pool headroom while the automation service was unavailable.
Optimize Autoscaler Background Recovery Processing: Avoid memory exhaustion by paging large recovery queries and preventing full materialization of large operation and log datasets. Status: In Progress.
Separate Autoscaling Control Path from Maintenance Workers: Ensure unrelated background maintenance workloads cannot impact the autoscaling function. Status: Planned.
Add High-Severity Alerts for Autoscaler Crash Loops and Failed Health Probes: Detect autoscaler unavailability before scale actions are missed. Status: In Progress.
Add Alerts for Skipped, Delayed, or Stuck Database Pool Scale Executions: Identify when expected scale actions are not completed within the required window. Status: Planned.
Increase Autoscaler Runtime Headroom and Review Replica Strategy: Reduce the risk of CPU and memory saturation in the automation service and improve service resilience. Status: Planned.
Define Regional Manual Scale Runbook and Escalation Path: Ensure rapid manual mitigation when automation is unavailable. Status: Completed.
Review Regional Pool Headroom Thresholds: Reduce the risk of recurrence during peak traffic while autoscaling is unavailable or delayed. Status: In Progress.
Post-Incident Monitoring Window: Continue monitoring the India pool after mitigation to confirm sustained stability. Status: In Progress.
we are seeing some degradation in SEA region and actively working on it.
we are seeing some degradation in SEA region and actively working on it.
Customer updates
the failure was caused by Azure SQL worker/request exhaustion on the SQL elastic pool in SEA region. The issue is resolved and actively monitoring
we are seeing some degradation in SEA region and actively working on it.
Post-incident report
The immediate cause of the incident was a failure in the internal autoscaling automation responsible for scaling regional
Azure SQL elastic pools. The automation service became unhealthy because one of its background maintenance
workers encountered memory exhaustion while processing a large recovery workload. This caused repeated application
process exits and health probe failures in the automation service runtime.
Because the autoscaling service was not healthy during the India incident window, scale-up actions that would normally
add database pool compute capacity were not completed in time. The affected pool therefore operated with insufficient
headroom while production traffic continued, causing elevated compute and worker utilization and resulting in slower
responses for some users.
The issue was not caused by a customer-side change, a specific tenant database, Azure SQL storage saturation,
transaction log saturation, or an Azure platform outage. The trigger was an internal automation service reliability issue
combined with insufficient regional pool headroom while the automation service was unavailable.
Optimize Autoscaler Background Recovery Processing
Action: Avoid memory exhaustion by paging large recovery queries and preventing full materialization of large operation and log datasets.
Status: In Progress
Separate Autoscaling Control Path from Maintenance Workers
Action: Ensure unrelated background maintenance workloads cannot impact the autoscaling function.
Status: Planned
Add High-Severity Alerts for Autoscaler Crash Loops and Failed Health Probes
Action: Detect autoscaler unavailability before scale actions are missed.
Status: In Progress
Add Alerts for Skipped, Delayed, or Stuck Database Pool Scale Executions
Action: Identify when expected scale actions are not completed within the required window.
Status: Planned
Increase Autoscaler Runtime Headroom and Review Replica Strategy
Action: Reduce the risk of CPU and memory saturation in the automation service and improve service resilience.
Status: Planned
Define Regional Manual Scale Runbook and Escalation Path
Action: Ensure rapid manual mitigation when automation is unavailable.
Status: Completed
Review Regional Pool Headroom Thresholds
Action: Reduce the risk of recurrence during peak traffic while autoscaling is unavailable or delayed.
Status: In Progress
Post-Incident Monitoring Window
Action: Continue monitoring the SEA regional pool after mitigation to confirm sustained stability.
Status: In Progress
Currently observing performance degradation in the india region.
we are seeing some degradation in India region and actively working on it.
Customer updates
The issue is mitigated. We are actively monitoring the instances.
we are seeing some degradation in India region and actively working on it.
Post-incident report
The root cause was an Azure platform-side availability degradation on the public Azure Load Balancer
alb-monolith-india-api, which fronts the backend service path used by the Application Gateway apipool
configuration. Azure Resource Health classified the event as Unavailable / Downtime, PlatformInitiated,
Unplanned, and Transient.
As the Load Balancer data path degraded, Application Gateway health and routing for the apipool
upstream became unavailable. The gateway therefore returned 502 responses with
ERRORINFO_UPSTREAM_NO_LIVE instead of forwarding those requests to the backend service
instances.
Backend VM availability metrics did not show a regional compute outage as the initiating condition.
VMSS autoscale activity occurred after the Load Balancer health event had already started, and
automatic VM repair activity occurred after service recovery was already underway. This timing
indicates that autoscale and repair activity were response or recovery signals, not the initial root cause.
The India region availability incident is resolved. Monitoring remains active for Application Gateway
failed requests, backend health, Azure Load Balancer availability, VMSS availability, and regional
synthetic probes.
Currently observing performance degradation in Indonesia region. Investigation is in progress.
Currently observing performance degradation in the Indonesia region container services.
Customer updates
Starting at 05:24 UTC on 21 May 2026, Azure experienced a service disruption affecting the Azure Container Registry GetToken API in Southeast Asia region. This caused the authentication failures or delays when obtaining access tokens for registry operations and affected our services. We had temporarily changed the registry region and the services are up and working now.
Services back to normal we are actively monitoring it
Currently observing performance degradation in the Indonesia region container services.
Post-incident report
The issue was caused by an Azure-side underlying storage dependency failure, which degraded backend operations and resulted in increased server errors for Azure Container Registry requests.
Azure has mitigated the issue, and we will monitor the service.
Service Degradation Impacting Some Microservices Across Regions
Some microservices may experience slowness, intermittent failures, or degraded performance across affected regions.
Customer updates
The SSL certificate pi configuration is updated to all the services, and access is restored. The app is submitted for review and waiting for the app get approved and we will release the same once it is approved
Web access has been restored and is working normally across the affected services. We are continuing to work with the mobile team to resolve SSL pinning failures impacting some white-labeled mobile apps. Further updates will be shared as soon as the mobile fix is validated.
We have identified an additional impact affecting some mobile applications that use SSL certificate pinning. The affected endpoints are now serving the updated valid TLS certificate. However, some white-labeled mobile applications appear to have pinned the previous certificate or public key, causing SSL pinning validation failures after the certificate reissue. Browser access and non-pinned clients are recovering after the certificate binding refresh. Mobile applications with certificate pinning may continue to experience connectivity failures until their pin configuration is updated to trust the new certificate/public key. Our teams are working on remediation options for impacted white-labeled mobile apps and will provide further updates.
Some microservices may experience slowness, intermittent failures, or degraded performance across affected regions.
Post-incident report
The incident was caused by a mismatch between the newly issued TLS certificate served by the affected endpoints and the SSL certificate/public key pinned in some white-labelled mobile applications.
Although the updated valid TLS certificate was deployed successfully, certain mobile apps continued to trust the previously pinned certificate or public key. This resulted in SSL pinning validation failures for those apps, causing mobile connectivity issues until the pinning configuration was updated.
All remediation steps have been completed. The SSL certificate pinning configuration has been updated across the required services, and access has been restored.
The remaining actions are:
App review and release
The updated mobile app build has been submitted for review. Once approved, the app will be released.
Post-release monitoring
Continue monitoring application availability, SSL handshake errors, mobile login/access issues, and customer-reported incidents after the app release.
Client communication
Inform impacted clients once the updated build is approved and released, and request them to update/upload the latest white-labelled app version where applicable.
Certificate rotation checklist
Maintain a checklist of all services and mobile applications using SSL certificate pinning to ensure they are validated during future certificate updates.
Preventive improvement
Review the SSL pinning approach and consider using public key pinning or backup pins where supported, to reduce the risk of similar issues during future certificate renewals or reissues.
Currently we are facing Performance Degradation in India region, and we are investigating the same
Currently observing performance degradation in the India region. Investigation is in progress.
Customer updates
Currently observing performance degradation in the India region. Investigation is in progress.
Performance Degradation in India Region
Users in the India region experienced temporary performance slowness in multiple pages like module listing page, app login.
Customer updates
the performance is back to normal, We are analysing the reason for this degradation. Also we are actively monitoring the service
Users in the India region experienced temporary performance slowness in multiple pages like module listing page, app login.
Post-incident report
The degradation was caused by an unexpected service instability triggered by changes in the runtime environment, which impacted the availability of a critical component in the production region.
1. Introduce additional safeguards and validation checks before runtime changes
2. Improve failover and recovery mechanisms to minimize impact duration
3. Schedule critical changes with enhanced validation and rollback strategies
SSO service is degraded. Users unable to login via sso
SSO service is degraded. Users unable to login via sso. We are investigating the same
Customer updates
The impacted component was removed from the production environment, following which the service was restored. The service is now up and running, and the system is being actively monitored to ensure stability.
A deviation from the established change management process was identified in the production environment, resulting in service downtime. Our team is working on mitigating the same.
SSO service is degraded. Users unable to login via sso. We are investigating the same
Post-incident report
As part of ongoing efforts to deprecate the legacy SSO integration, a production change related to the older SSO configuration led to service instability for clients still using the legacy authentication mechanism. This resulted in temporary access issues for those clients.
• Affected clients will be advised to migrate from the legacy SSO to the new SSO integration to ensure continued stability and support.
• The legacy SSO will be progressively deprecated in a controlled manner to avoid future disruptions.
• Additional validation and monitoring will be implemented during the deprecation process to minimize client impact.
Performance Degradation in India Region
Users in the India region experienced temporary performance slowness in multiple pages like module listing page, app login.
Customer updates
Users in the India region experienced temporary performance slowness in multiple pages like module listing page, app login.
Post-incident report
During the recent release, a configuration issue in the new build resulted in an unintended app service plan change during the swap process.
As a mitigation measure, all pipelines were enhanced to perform app service swaps automatically, reducing the risk of manual configuration errors.