fix: Auto-adjust step size to prevent exceeding Prometheus max resolution #503

arikalon1 · 2026-01-26T12:24:47Z

Summary

Fixes #490

When using simple-limit strategy with long history durations and small step sizes, the number of data points can exceed Prometheus's maximum resolution of 11,000 points per timeseries, causing queries to fail with:

PrometheusApiClientException: HTTP Status Code 400 (b'{"status":"error","errorType":"bad_data","error":"exceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution (?step=XX)"}')

Root Cause

The default settings are:

history_duration: 336 hours (2 weeks)
timeframe_duration: 1.25 minutes (75 seconds)

This results in: 336 * 60 * 60 / 75 = 16,128 data points - which exceeds the 11,000 limit.

The simple strategy works because it uses PercentileCPULoader with QueryType.Query (a simple instant query), while simple-limit uses CPULoader with QueryType.QueryRange (a range query that returns all data points).

Solution

This PR adds automatic step adjustment in the load_data method of PrometheusMetric:

For QueryRange queries, calculate the expected number of data points
If it exceeds MAX_PROMETHEUS_POINTS (10,000 - using a safety margin below the 11,000 hard limit), automatically increase the step size
Log the adjustment at debug level for visibility

Changes

Added MAX_PROMETHEUS_POINTS constant (10,000)
Added _calculate_safe_step() method to compute adjusted step when needed
Modified load_data() to call _calculate_safe_step() for range queries
Added logging import for debug messages

Testing

This fix can be tested by running:

krr simple-limit --history_duration=336 --timeframe_duration=1.25

Before this fix: Fails with "exceeded maximum resolution" error
After this fix: Automatically adjusts step size and completes successfully

Considerations

Uses 10,000 as the limit instead of 11,000 for safety margin
Only affects QueryRange type queries (regular Query types don't have this issue)
The adjustment is transparent - users don't need to change their configuration
Debug logging shows when adjustment occurs for troubleshooting

…tion When querying Prometheus with long history durations and small step sizes, the number of data points can exceed Prometheus's maximum resolution of 11,000 points per timeseries. This fix automatically increases the step size when the calculated number of points would exceed 10,000 (using 10,000 as a safety margin below the 11,000 hard limit). Fixes #490

coderabbitai · 2026-01-26T12:25:12Z

Walkthrough

Added a safety mechanism to Prometheus query step size calculation. Introduces a helper method that adjusts the query step size to ensure the resulting data point count stays under 10,000 points, preventing Prometheus errors from oversized queries.

Changes

Cohort / File(s)	Summary
Prometheus Query Safety `robusta_krr/core/integrations/prometheus/metrics/base.py`	Added module-level logger and `MAX_PROMETHEUS_POINTS` constant (10,000). Introduced `_calculate_safe_step()` method to dynamically adjust query step sizes based on query period. Method ensures data point count (period/step) does not exceed the limit by increasing step when necessary. Applied adjustment in data loading for QueryRange queries.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: auto-adjusting Prometheus step size to prevent exceeding the maximum resolution limit, which is the core objective of this PR.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the problem, root cause, solution, and testing approach for the Prometheus max resolution fix.
Linked Issues check	✅ Passed	The PR fully addresses issue `#490` by implementing automatic step adjustment to prevent exceeding Prometheus's 11,000-point limit, with transparent adjustment and debug logging as required.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the Prometheus resolution limit issue in the base.py file, with no extraneous modifications or unrelated code changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/prometheus-max-resolution-490

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@robusta_krr/core/integrations/prometheus/metrics/base.py`:
- Around line 127-162: The _calculate_safe_step function currently uses the raw
step.total_seconds() while the actual query later rounds step to whole seconds,
which can still exceed MAX_PROMETHEUS_POINTS for fractional steps; modify
_calculate_safe_step to normalize the incoming step to the integer-second
resolution used for queries (e.g., round/truncate step.total_seconds() to an int
second value before computing expected_points and min_step_seconds) and return a
timedelta based on that normalized integer seconds, and ensure the same
normalized integer second value is used when constructing step_str so the safety
calculation and actual query step are identical (refer to _calculate_safe_step
and the code that builds step_str).

coderabbitai · 2026-01-26T12:31:18Z

robusta_krr/core/integrations/prometheus/metrics/base.py

+    def _calculate_safe_step(self, period: datetime.timedelta, step: datetime.timedelta) -> datetime.timedelta:
+        """
+        Calculate a step size that won't exceed Prometheus's maximum resolution limit.
+
+        If the number of data points (period / step) would exceed MAX_PROMETHEUS_POINTS,
+        this function returns an increased step size that keeps the point count under the limit.
+
+        Args:
+            period: The time period for the query.
+            step: The originally requested step size.
+
+        Returns:
+            A step size that keeps the number of data points under MAX_PROMETHEUS_POINTS.
+        """
+        period_seconds = period.total_seconds()
+        step_seconds = step.total_seconds()
+
+        # Calculate expected number of points
+        expected_points = period_seconds / step_seconds
+
+        if expected_points <= MAX_PROMETHEUS_POINTS:
+            return step
+
+        # Calculate the minimum step size needed to stay under the limit
+        min_step_seconds = period_seconds / MAX_PROMETHEUS_POINTS
+
+        # Round up to the nearest second to ensure we're under the limit
+        adjusted_step_seconds = int(min_step_seconds) + 1
+
+        logger.debug(
+            f"Adjusting step from {step_seconds}s to {adjusted_step_seconds}s to avoid exceeding "
+            f"Prometheus max resolution ({expected_points:.0f} points -> {period_seconds / adjusted_step_seconds:.0f} points)"
+        )
+
+        return datetime.timedelta(seconds=adjusted_step_seconds)
+


⚠️ Potential issue | 🟠 Major

Align safe-step math with the actual query step to avoid still exceeding the limit.

_calculate_safe_step uses the raw step.total_seconds(), but Line 218 rounds the step to whole seconds for the actual query. For fractional steps (e.g., 1.49s), rounding down can inflate the real point count and still hit Prometheus’s 11k cap. Normalize to the same integer-second resolution inside _calculate_safe_step and use that value when building step_str.

🔧 Proposed fix

@@ -import logging +import logging +import math @@ - step_seconds = step.total_seconds() + # Normalize to the same integer-second resolution used in the query + step_seconds = max(1, round(step.total_seconds())) @@ - if expected_points <= MAX_PROMETHEUS_POINTS: - return step + if expected_points <= MAX_PROMETHEUS_POINTS: + return datetime.timedelta(seconds=step_seconds) @@ - min_step_seconds = period_seconds / MAX_PROMETHEUS_POINTS - - # Round up to the nearest second to ensure we're under the limit - adjusted_step_seconds = int(min_step_seconds) + 1 + min_step_seconds = math.ceil(period_seconds / MAX_PROMETHEUS_POINTS) + + # Round up to the nearest second to ensure we're under the limit + adjusted_step_seconds = max(step_seconds, min_step_seconds) @@ - step_str = f"{round(step.total_seconds())}s" + step_str = f"{int(step.total_seconds())}s"

Also applies to: 214-218

🤖 Prompt for AI Agents

In `@robusta_krr/core/integrations/prometheus/metrics/base.py` around lines 127 - 162, The _calculate_safe_step function currently uses the raw step.total_seconds() while the actual query later rounds step to whole seconds, which can still exceed MAX_PROMETHEUS_POINTS for fractional steps; modify _calculate_safe_step to normalize the incoming step to the integer-second resolution used for queries (e.g., round/truncate step.total_seconds() to an int second value before computing expected_points and min_step_seconds) and return a timedelta based on that normalized integer seconds, and ensure the same normalized integer second value is used when constructing step_str so the safety calculation and actual query step are identical (refer to _calculate_safe_step and the code that builds step_str).

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Auto-adjust step size to prevent exceeding Prometheus max resolution #503

fix: Auto-adjust step size to prevent exceeding Prometheus max resolution #503

Uh oh!

arikalon1 commented Jan 26, 2026

Uh oh!

coderabbitai bot commented Jan 26, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Auto-adjust step size to prevent exceeding Prometheus max resolution #503

Are you sure you want to change the base?

fix: Auto-adjust step size to prevent exceeding Prometheus max resolution #503

Uh oh!

Conversation

arikalon1 commented Jan 26, 2026

Summary

Root Cause

Solution

Changes

Testing

Considerations

Uh oh!

coderabbitai bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Jan 26, 2026 •

edited

Loading