Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/uptime query v3 #1054

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Feat/uptime query v3 #1054

wants to merge 14 commits into from

Conversation

ckbedwell
Copy link
Contributor

Problem

Currently our uptime query is not only incredibly opaque but also inaccurate in a variety of situations. We need to document how our uptime query is handled for both internal and external users, ensuring the reasoning, explanations and reliability of the query are as clear as possible.

Previously we implemented a v2 of the uptime query behind a feature flag after an escalation but we never enabled it for all stacks. We didn't have a sufficiently confident testing strategy when releasing it to our customers and we were unsure if we would be trading one bug for another set.

Solution

This PR adds an uptime v3 to replace all uptime queries. It is not behind a feature flag and it removes the query and feature flag for v2. The query has been tested extensively in a variety of situations and has come out looking very accurate. Please thoroughly read the uptime.md internal documentation that this PR adds.

I am in the middle of documenting how to set up and use the Prometheus test data library, which is how I settled on uptime v3 and the assessments of how accurately it performs. I will add this to the documentation once it is ready.

This PR also addresses some tangentially related UX inconsistencies when assessing uptime across the application:

  1. All areas of the application displaying uptime now use the same query.
  2. All scenes dashboards default to showing the last 3 hours (the same as the check list page)

Both of these changes combined mean that uptime should remain consistent and not unexpectedly 'jump' around as you are navigating from one route to another.

This PR does not intend to address reachability or any other metric panels on our dashboards (that will come later 😄).

@ckbedwell ckbedwell requested a review from a team as a code owner January 30, 2025 16:32
@ckbedwell ckbedwell requested review from w1kman, VikaCep, mem and peterschretlen and removed request for a team January 30, 2025 16:32
Base automatically changed from chore/remove-unused-singlecheckmode to main January 30, 2025 21:43
Comment on lines +111 to +117
## sum(...)

Because of the `probe` label, we can have multiple results for a single time point. We sum all of the results for all probes given a time point to get a single result.

## clamp_max(..., 1)

Each assessment of uptime in a time point is a binary result: it is either 1 (success) or 0 (failure). In the previous step we sum all of the probe results for a given time point and in this step we 'clamp' the result to a maximum of 1. We only care if a time point is successful or not. If a time point has 3 successful probes, we still only want to count it as 1 successful time point (individual successes are represented by our reachability metric).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can replace this with

max by () (...)

where the grouping can be empty, to discard all labels, or it can be (job, instance) to shed off labels other than job and instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not that your query is wrong, it's that this alternative has less noise to read thru (and it's possibly more efficient)

Taking into account all of the above, our uptime calculation is the following:

```
clamp_max(sum(max_over_time(probe_success{job="$job", instance="$instance"}[$frequencyInSeconds])), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is missing in the explanation.

This query is computing whether the target is up for a single time step.

As a range query, it's doing that computation for the entire time range, but it's not producing a single number, it's producing a sequence of "up", "down", "up", "up", etc. IOW, it's producing a sequence of 0, 1, 1, 0, 1, 0, etc.

That sequence of numbers needs to be reduced to a single number, as explained above.

There are two ways to achieve that:

  1. avg_over_time((...)[$range:$frequency])
  2. A Grafana transformation that loads all the returned samples, and computes the average of all of them

The first option bumps into Mimir's (Grafana Cloud's) limits for how big a range can be for this kind of computation (90 days, IIRC). The second one will bump against how many samples Mimir is willing to return for a single query (somewhere in the 10000 range, IIRC).

To be more explicit: for the first option, in Grafana this needs to be set to an instant query, but that instant query will contain a subquery that is a range query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option bumps into Mimir's (Grafana Cloud's) limits for how big a range can be for this kind of computation (90 days, IIRC). The second one will bump against how many samples Mimir is willing to return for a single query (somewhere in the 10000 range, IIRC).

The limit is expressed in terms of hours. It's 768 hours, which corresponds to 32 days. Either there's some padding in there to allow a 30-day query to work, or there's something about 768 that I'm missing.

Playing around with Grafana, I can see that if I try to do something like:

max by (instance, job) (probe_success{instance=..., job=...})

and specify a range query for 90 days, it will happily return some results, but they are 1 hour apart (1 * 24 * 90 = 2160 samples).

If try with an instant query like:

max by (instance, job) (probe_success{instance="...", job="..."})[30d:2m]

it will return all the results (30 * 24 * 30 = 21600)

This on the other hand returns an error:

max by (instance, job) (probe_success{instance="...", job="..."})[90d:2m]
execution: the query time range exceeds the limit (query length: 2160h3m45.551s, limit: 768h0m0s) (err-mimir-max-query-length). To adjust the related per-tenant limit, configure -querier.max-partial-query-length, or contact your service administrator.

Even something like:

max by (instance, job) (probe_success{instance="...", job="..."})[90d:1h]

which is equivalent to what Grafana is doing, does not work.


Uptime v3 showing 49.7% uptime for the `20p_shared_random` scenario. The underlying data has a 50% uptime so the result is only off by 0.3%, which is very impressive for a single query quickly evaluating 10,500,000 individual data points (525,000 minutes in a year x 20 probe results).

Honestly, no idea - Prometheus Black magic is the only answer I can provide. I have gone over the whole testing process from start to finish with a fine tooth comb to understand if there is any fault in the underlying data that is generated, the backfill process and uploading to Prometheus but it is all correct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say without knowing how the query is getting reduced from the N samples returned by the range query above to a single result.


![](./images/logs_vs_visualization.png)

The above image demonstrates visualizing a 10-minute period that is approximately 11 months into the past of the dataset and cross-referencing the raw data that was generated that is inserted into Prometheus. Uptime v3 is precisely correct in its calculation, Uptime calculations V1 and V2 are both _wildly_ incorrect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a bit of time for this to click for me, until I looked at this specific image.

The reason why the existing queries prefer probe_success_all_sum / probe_success_all_count vs probe_success is because it's possible for the agent to have the results, but fail to publish them for whatever reason (the Internet sits between the agent and the database).

With probe_success, if the sample is lost, it's lost. Prometheus will do its best to fill in the gaps, but with the proposed query, it's essentially impossible to do that.

With probe_success_all_sum, the agent collects the data, fails to publish it, and at a later point, it publishes it again. Let's say the actual data is up, down, up. probe_success will be 1, , 1. For that range, the increment in probe_success_all_sum will be 2 and probe_success_all_count will be 3. This is why the question of how the data is being reduced is important. One option is to treat 1, , 1 as 1, 1 (ignore missing data), and the other is 1, 0, 1 (missing data is zero). The first would produce (I suspect) 100% and the second 66%. The probe_success_all_sum way produces 2/3 = 66%. You can do the whole exercise again assuming the real data is up, up, up, giving you 100%, 66% and 100% respectively. You can consider down, up, down, giving 0%, 0% and 33%.

In the case where the agent fails to collect the data, you have a gap. I suspect the proposed query looks at 1, , 1 and produces 100% or 66% depending on how the data is reduced. The probe_success_all_sum metric increases by 2 (same for the count) and it produces 100%. In this case, lacking any other piece of evidence, 100% is the correct result.

The problem with probe_success_all_sum is that there's no easy way to aggregate it across locations.

I guess this is a very long way of saying that it would be useful to know what happens in this case when there's missing data.


![](./images/v1_v3_comparison.png)

The v1 query (bottom) shows both an incorrect single stat (94.9%) and a graph that shows a gradual increase in uptime at the beginning of the time period and a gradual decline at the end. Neither of these are correct and are a misleading interpretation of the underlying data. It is impossible to have a gradual slope in uptime: it is either a successful time point or it is not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I can disagree with, at least in the form it's written.

While it's true that at any given instant the target must be either up or down, "uptime" is inherently a metric that has an associated range: something has been up x% of the time range. If you take 10 samples 1 minute apart, and the first sample is "down" and the rest are "up", your initial uptime is not 100% (we can debate what it should be) and after that the "uptime over the last 5 minutes" will eventually reach 100% (because the last 5 samples were "up"). Somewhere before that it must be 80% (because somewhere you will have down, up, up, up, up). Therefore it must be true that there's some interval where it's going from 80% to 100%.


### What happens if we remove the `max_over_time` aggregation from the query?

Prometheus has a feature where it has 'assumed continuity' of metrics. This means that if a metric is not reported for a given time point, Prometheus will assume that the metric is the same as the previous time point for the next five minutes before it stops reporting. For Synthetic Monitoring, this is a terrible assumption to make and can have a detrimental effect on the accuracy of our uptime calculation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a link for "Prometheus has a feature where it has 'assumed continuity' of metrics."?

I think I understand what you mean by this, but it would be useful to have a reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe what you are referring to here is the "lookback delta" parameter: how far back can Prometheus go before it considers a metric to be stale. This does default to 5m.

https://prometheus.io/docs/prometheus/latest/command-line/prometheus/ where it talks about --query.lookback-delta.

Aggregation functions do look at actual samples, so what you are saying here is that by adding max_over_time you are making sure that the range includes an actual observation. The problem with aligning this tightly with frequency is that if the scrape is delayed, the interval between one sample and the next might be larger than frequency. That means that, depending on how the data is treated, the target will be considered down. This is the reason why the general recommendation is to use more than twice the scrape interval for e.g. rate. The problem with expanding the interval is that two samples might exist within that interval, and by asking for max, any downtime in that interval is ignored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is this:

If this is the ideal execution time:

t0      t1      t2      t3
*-------*-------*-------*---

and this is the actual time:

t0        t1      t2      t3
*---------*-------*-------*---

running this query at a time t between t1 and t2 effectively does this:

t0        t1   t  t2      t3
*-----[---*---+---*-------*---

where [ is the marker for t - frequency

you can see that t1 falls within that range.

But if you evaluate that query at a time t just before t1, what you get is:

t0        t1      t2      t3
*-[------+*-------*-------*---

and there's no sample within that range.

If you increase that range to compensate, you might end up with this:

t0        t1      t2      t3
*--------[*-------*+------*---

and now you have two samples within that range.


Prometheus has a feature where it has 'assumed continuity' of metrics. This means that if a metric is not reported for a given time point, Prometheus will assume that the metric is the same as the previous time point for the next five minutes before it stops reporting. For Synthetic Monitoring, this is a terrible assumption to make and can have a detrimental effect on the accuracy of our uptime calculation.

The most likely scenario for a metric to stop reporting results is when a check's configuration has been altered. This could be a change in frequency, the number of probes or the check's definition of uptime itself. The `config_version` changes to a nanosecond unix timestamp when the check's configuration changes and is the most likely time that uptime would be affected (i.e. accidentally introducing a flaw in their uptime definition).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is true that config_version is a timestamp, I don't think this is documented anywhere, and that's intentional. The value should be considered a monotonically increasing value without any further meaning. In other words A > B means the configuration A was applied at a point in time after B.

Note that because of the way the query is constructed, changes in config_version won't necessarily affect the results. The only change that is problematic is changing the frequency.

Comment on lines +71 to +81
select: (data) => {
const vals = data[0].values;
const total = vals.reduce((acc, [_, value]) => {
return acc + Number(value);
}, 0);

if (vals.length === 0) {
return null;
}

return total / vals.length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this will ignore NODATA.

if (!metricsDS) {
return Promise.reject(`You need to have a metrics datasource available.`);
}

return queryMetric<MetricCheckSuccess>(url, query, options);
return queryRangeMetric({ url, query: expr, ...getStartEnd(), step: interval });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using interval as the step means a value is returned for each time range where there should be a sample. With a single probe, there's some chance for missing data that was published "too late" (and for the same reason, ignoring intervals where there are two samples, from different executions)


![](./images/sm-architecture.png)

[A more detailed explanation of the Synthetic Monitoring architecture in our on-call training.](https://docs.google.com/presentation/d/1jWuv2VwVFIsdyXq4JQMlhmM6txHXLyHImYpnl50V5XQ/edit#slide=id.g308ff111cb5_0_35).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging that this is an internal-only link in a public repo, so the general public may click thinking they can access. It might be worth adding "(internal-only)" or something to set expectations this won't be externally visible.

Suggested change
[A more detailed explanation of the Synthetic Monitoring architecture in our on-call training.](https://docs.google.com/presentation/d/1jWuv2VwVFIsdyXq4JQMlhmM6txHXLyHImYpnl50V5XQ/edit#slide=id.g308ff111cb5_0_35).
[A more detailed explanation of the Synthetic Monitoring architecture in our on-call training (Grafana Labs internal-only).](https://docs.google.com/presentation/d/1jWuv2VwVFIsdyXq4JQMlhmM6txHXLyHImYpnl50V5XQ/edit#slide=id.g308ff111cb5_0_35).


`max_over_time` is a [Prometheus aggregation function](https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time) that returns the maximum value of a metric over a given time range, in this case a check's frequency in seconds: `$frequencyInSeconds`.

We use (`$frequencyInSeconds`) to determine what the 'look back' time should be (which gives us a 'time point'). For example, if the check is running every 15 seconds, the step would be 15 seconds. If the check is running every minute, the step would be 60 seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Marcelo also covers this in his comments, but by using frequencyInSeconds do we create the possibility of getting no samples in a range?

I think the assumption of a data point in the range will hold much of the time, but what happens when it does not? I'm thinking of script-based tests in particular, there is variability because of script run duration, lag, and retries that can happen on certain types of failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants