Beryl.
← All articles
Beryl Analytics Blog

Anomaly Detection for Operations: Catching Problems Before Customers Do

The goal of operational anomaly detection is simple to state and hard to do well: find out something is wrong from your own systems before a customer finds out and tells you. A dropped payment provider, a stalled data pipeline, a sudden spike in failed logins, a quiet collapse in a conversion rate. Each of these has a window where you could have caught it for free. This article covers the methods that catch them, and just as importantly, the alert design that keeps your team from learning to ignore the alerts entirely.

Why naive monitoring fails

The instinct is to set a fixed threshold: alert if orders per hour drop below 100. This works until the first quiet Sunday night, when orders legitimately fall below 100 and the alert fires for no reason. Fix that by lowering the threshold, and now you miss a real 40 percent drop on a busy Tuesday because the number is still above your floor. Fixed thresholds fail because real business metrics have shape: they rise and fall with time of day, day of week, and season. Good anomaly detection has to know what normal looks like right now, not on average.

The methods, from simplest to most powerful

1. Statistical thresholds done properly

The simplest method that actually works is a rolling statistical band. Instead of a fixed floor, compute a rolling mean and standard deviation over a recent window, and flag points that fall several standard deviations away. This adapts to the current level of the metric and is trivial to implement and explain. It is the right starting point for most metrics and will catch the majority of dramatic breaks. Its weakness is that it assumes the data is roughly stable and symmetric, which many business metrics are not.

2. Seasonal decomposition

Most operational metrics have strong daily and weekly cycles. Seasonal decomposition splits a series into three parts: the trend (the slow underlying movement), the seasonal component (the repeating daily and weekly pattern), and the residual (what is left over). You run your anomaly detection on the residual, because once you have removed the expected weekly rhythm, a genuine anomaly stands out cleanly. This is the single biggest upgrade for metrics like traffic, orders, or signups, where a Sunday low and a Monday high are completely normal and should never alert. Decomposition lets you say a Monday is low for a Monday, which is the question you actually care about.

3. Isolation forest and multivariate methods

The methods above watch one metric at a time. Sometimes the anomaly only appears in the relationship between metrics: traffic is normal and conversion is normal, but the combination is impossible. Isolation forest is a machine learning method that handles this. It works by randomly partitioning the data and noticing that anomalous points get isolated in fewer splits, because they sit in sparse regions of the space. It needs no labelled examples of past failures, handles many dimensions at once, and is well suited to catching the subtle, multivariate problems that single-metric thresholds miss. The trade is that its alerts are harder to explain, so pair it with the simpler methods rather than replacing them.

Alert design: the part everyone underinvests in

A detection method that fires twenty times a day is worse than no method at all, because it trains your team to swipe alerts away without reading them. Then the one real alert in the pile gets ignored with the rest. This is alarm fatigue, and it is the most common way anomaly detection projects quietly fail. The detector was fine. The alerting was hostile.

A practical rollout

Do not try to monitor everything on day one. Start with the handful of metrics whose failure costs the most: payment success rate, core conversion, pipeline freshness, error rates. Put rolling statistical bands and seasonal decomposition on those, tune the alerts until precision is high, and only then expand coverage and introduce multivariate methods for the subtle cases. The aim is a small number of alerts that people trust enough to act on immediately. If you want help building this into your operational stack, Beryl Analytics designs anomaly detection systems that prioritise actionability over coverage for its own sake.

Takeaways

Frequently asked questions

How much historical data do I need? For seasonal methods, at least several full cycles of the longest season you care about, so several weeks for weekly patterns and ideally a year if there is meaningful annual seasonality.

Should I build or buy? Statistical bands and decomposition are cheap to build and keep you in control. Buy when you need turnkey coverage across hundreds of metrics fast. Many teams build for their critical few and buy for the long tail. We can help you decide.

anomaly detectionanomaly detection methodsoperational monitoringoutlier detection

Want analytics that actually moves the number?

Beryl Analytics builds predictive models, data pipelines, and dashboards that drive decisions for businesses across New Zealand and Australia. We ship to production and prove the return.

Talk to Beryl Analytics
© Beryl Analytics 2026. AI-powered data analytics for New Zealand and Australia.
Home  •  Blog  •  Terms  •  Privacy