Near Real Time Service Monitoring Using High-Dimensional Time Series

We demonstrate a near real-time service monitoring
system for detecting and diagnosing issues from high-dimensional
time series data. For detection, we have implemented a learning
algorithm that constructs a hierarchy of detectors from data.
It is scalable, does not require labelled examples of issues for
learning, runs in near real-time, and identifies a subset of counter
time series as being relevant for a detected issue. For diagnosis,
we provide efficient algorithms as post-detection diagnosis aids
to find further relevant counter time series at issue times, a
SQL-like query language for writing flexible queries that apply
these algorithms on the time series data, and a graphical user
interface for visualizing the detection and diagnosis results. Our
solution has been deployed in production as an end-to-end system
for monitoring Microsoft’s internal distributed data storage and
computing platform consisting of tens of thousands of machines
and currently analyses about 12000 counter time series