freshtracks

Monitoring Methodology

How we analyze the performance of your system.

We have chosen to use the USE method for monitoring your infrastructure metrics. USE stands for Utilization, Saturation, and Errors. It's a methodology for monitoring the performance of your system and its resources. The resources we have chosen are CPU, memory, Disk I/O, and network. You can find the USE graphs for these resources in the FreshTracks.io plugin.

Figure: Example graph with metrics

Figure: Example graph with metrics

As you can tell from the image above, and if you navigate to view "All" metrics, the charts for errors are omitted. While developing these metrics, found that the errors were not helpful in terms of infrastructure metrics and are better suited for application metrics.

Utilization

Utilization is a measure of how busy the service is doing work. Both spikes and dips in resource utilization may be a signal of things going wrong in your system.

For CPU, it's the sum of the rates of total used cpu seconds across all containers over the last minute. This metric shows to what degree your CPU cores are being utilized.

For Memory, it's the sum of the working memory bytes across all containers. This metric shows the level of memory used in your system.

For Disk IO, it's the sum of the rates of all *written bytes in the containers over the last minute. This shows the number of reads/second.

For Network, it's the sum of the rates of bytes received on the network over the last minute, plus the sum of rates of bytes transmitted on the network over the last minute.

Saturation

Saturation is a measure of the degree to which the resource has extra work it cannot do.

For CPU, it's the sum of the rates of total CPU CFS throttle seconds across all containers over the last minute. The CPU will be throttled when Kubernetes CPU limits are exceeded or when configured CPU requests come into play during node CPU pressure.

For Memory, it's the sum of the rates of usage hit limit events over the last minute.

For Disk IO, it's the sum of the rates of memory limit reached events over the last minute.

For Network, it's the sum of the rates of dropped packets across all network traffic over the last minutes.