While taking the inevitable step and starting to use Machine-Learning powered analysis, one might become concerned with the risk of "alert-flood" – the scenario where the rate of (false and true) alarms flood the personnel who are to handle these alerts.
After all, more often than not, this is already the situation, and that's when you're only using human-defined rules and thresholds.
In fact, Loom field-engineers have more than once installed Loom Ops in environments where existing tools were listing hundreds (!) of live critical alerts (and countless more of lower severity).
It's very clear that firing an alert which will never get attended is worse than not firing it at the first place.
Truth is, not only that Loom Ops doesn't flood with alerts; it will actually reduce the overall fatigue.

Surprised? keep reading.

Alert fatigue is the result of two key factors:

  • Overall rate of incidents
  • Quality of an incident, which directly translates to its Time-To-Resolve (MTTR)

Loom takes aim at both (a) producing the correct number of daily incidents and (b) accompanying every incident with a detailed root-cause analysis report, as well as human-language insights & recommendations.

To reduce the number of incidents, Loom Ops:

  • correlates, aggregates and determines causality between abnormalities - then hiding symptoms and emphasizing the root-cause
  • machine-learning based prioritization - every incident is analyzed using dozens of factors - which in turn are weighted to derive a priority score. The weights are continuously updated according to user feedback, and the system auto-tunes to only emit the daily number of incidents that the responsible team is able to handle

To reduce the MTTR, Loom Ops:

  • accompanies incidents with a root-cause analysis report, pointing to the element that should be attended to
  • attempts to match an insight and a recommendation - those come from crowd-sourced knowledge of Loom Ops users, and from past incidents resolved by you
Did this answer your question?