Monitoring Azure Functions at Scale

The Challenge of Serverless Observability

Traditional Application Performance Monitoring (APM) tools were built for a world of persistent processes. In that world, a server stays up for days or weeks, providing a stable environment for agents to collect data. Azure Functions, however, operate in a completely different paradigm where instances spin up and tear down in milliseconds. This ephemeral nature makes standard observability extremely challenging at the scale we operate, which involves over 72 million checks per month.

We quickly discovered that verbose logging at this volume is prohibitively expensive. If we left Application Insights on its default settings, our cloud bill would be dominated by log ingestion costs rather than actual computation. Furthermore, cold starts create significant observability blind spots. If a function fails to start or times out during initialization, it often leaves no trace in traditional logs. You simply cannot alert on a failure you cannot see, which is why we had to build a monitoring layer that is architectural rather than just log-based.

Architecture: Decoupling with Service Bus

To handle the massive throughput required for global monitoring, we decoupled check scheduling from check execution using Azure Service Bus. In our architecture, a central scheduler writes check request messages to regional queues. Regional worker Functions then consume these messages independently and execute the actual HTTP requests. This fan-out model is what enables us to process 72M+ checks per month without a monolithic scheduler becoming a bottleneck.

This decoupling also provides immense resilience. Each Azure region has its own dedicated queue, meaning a temporary worker outage in Singapore doesn't block workers in Paris or London. If a regional worker is overwhelmed, the messages simply stay in the queue until capacity is available. We also utilize dead-letter queues to catch failed checks that cannot be processed after multiple retries, ensuring that we never lose data and can investigate the root cause of execution failures.

Cost Optimization Strategies

Running millions of executions on the Azure Functions Consumption plan means every millisecond and every log entry has a direct price tag. To keep costs sustainable, we implemented structured telemetry with sampling. Instead of logging every successful check in full detail, we only log complete request and response payloads when a check fails. Successes are recorded as lightweight metrics, providing the necessary data for uptime percentages without the storage overhead.

Database interaction is another critical area for cost and performance. Writing a new row to Azure SQL for every single check would create massive IOPS pressure and latency. Instead, our workers batch-write CheckResult rows, buffering up to 50 results at a time before performing a bulk insert. Finally, choosing the right Service Bus tier was essential. While the Standard tier is sufficient for lower volumes, we transitioned to the Premium tier for high-volume production to ensure predictable latency and avoid the throughput throttling that can occur on shared infrastructure.

Managing Regional Latency

Network latency is the enemy of accurate monitoring. To provide reliable data, we deploy our monitoring workers to Azure regions closest to the monitored endpoints. This reduces network variance and ensures that the response times we report are representative of real-world user experiences. To mitigate the impact of cold starts on critical queues, we utilize the Azure Functions Premium plan for our most sensitive workers, keeping a minimum number of instances always-on and "pre-warmed."

We also have to account for the latency added by the monitoring infrastructure itself. Our workers measure the exact time spent on the HTTP request and subtract the internal overhead before reporting the final result. Multi-region results are then aggregated in our central dashboard. For a check to be marked as "Healthy," we can configure rules where a minimum number of regions must respond within their specific thresholds, ensuring that a localized provider outage isn't mistaken for a global API failure.

Lessons Learned

Building ContinuumNexus taught us that Application Insights custom events, rather than standard traces, are the right vehicle for structured check telemetry. They allow for much cleaner querying and better integration with our reporting dashboard. We also learned the hard way that exponential backoff retries are essential for transient network errors. Reporting a check as failed after a single 503 error leads to false positives; retrying twice over five seconds provides a much more accurate picture of true downtime.

Our most important takeaway is a simple philosophy: "If you build a monitoring tool, monitor it the same way you monitor your customers' APIs." Continuum's own production API and dashboard are monitored by our own engine from multiple regions. This self-monitoring loop ensures that if our infrastructure experiences a hiccup, we are the first to know. By treating observability as a core architectural requirement rather than a post-deployment afterthought, we've built a platform that remains stable even as the execution count continues to climb.

Monitoring Azure Functions at Scale

The Challenge of Serverless Observability

Architecture: Decoupling with Service Bus

Cost Optimization Strategies

Managing Regional Latency

Lessons Learned

Ready to monitor your APIs with confidence?

Related Posts

Website Outage Monitoring: How to Detect Downtime Before Your Users Do

Monitoring OAuth & Authentication Flows

Connection Error

Connection Error