Categories
devops

Eliminating NaNs in PromQL Histogram Quantile

Working with Prometheus and PromQL can be tricky. I found it challenging to define an SLO using sloth-sli for a service whose histogram had a lot of NaN (Not a Number) values. I searched around for an out of the box solution and didn’t find what I was looking for.

The key to the solution was the knowledge that in PromQL an NaN doesn’t equal itself: NaN != NaN.

With this in mind I could use the “unless” operator to filter out any value that didn’t equal itself.

p50 Latency Error Query

histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket{job="$job", namespace="$namespace"}[{{.window}}])) by (le)) > 5 unless (histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket{job="$job", namespace="$namespace"}[{{.window}}])) by (le)) != histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket{job="$job", namespace="$namespace"}[{{.window}}])) by (le))) OR on() vector(0)



This query returns the 0.5 bucket values that are over 5, and if the values that it evaluates do not equal themselves (i.e., are NaNs), it turns those into nulls. The nulls are evaluated to false (while the NaNs are evaluated to true) and then OR’ed with the vector 0 to create a consistent graph with no gaps.

Hope this tidbit helps someone, SLOs are hard enough as it is without NaNs in your life!