Skip to content

Conversation

@arikalon1
Copy link
Contributor

Summary

Fixes #490

When using simple-limit strategy with long history durations and small step sizes, the number of data points can exceed Prometheus's maximum resolution of 11,000 points per timeseries, causing queries to fail with:

PrometheusApiClientException: HTTP Status Code 400 (b'{"status":"error","errorType":"bad_data","error":"exceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution (?step=XX)"}')

Root Cause

The default settings are:

  • history_duration: 336 hours (2 weeks)
  • timeframe_duration: 1.25 minutes (75 seconds)

This results in: 336 * 60 * 60 / 75 = 16,128 data points - which exceeds the 11,000 limit.

The simple strategy works because it uses PercentileCPULoader with QueryType.Query (a simple instant query), while simple-limit uses CPULoader with QueryType.QueryRange (a range query that returns all data points).

Solution

This PR adds automatic step adjustment in the load_data method of PrometheusMetric:

  1. For QueryRange queries, calculate the expected number of data points
  2. If it exceeds MAX_PROMETHEUS_POINTS (10,000 - using a safety margin below the 11,000 hard limit), automatically increase the step size
  3. Log the adjustment at debug level for visibility

Changes

  • Added MAX_PROMETHEUS_POINTS constant (10,000)
  • Added _calculate_safe_step() method to compute adjusted step when needed
  • Modified load_data() to call _calculate_safe_step() for range queries
  • Added logging import for debug messages

Testing

This fix can be tested by running:

krr simple-limit --history_duration=336 --timeframe_duration=1.25

Before this fix: Fails with "exceeded maximum resolution" error
After this fix: Automatically adjusts step size and completes successfully

Considerations

  • Uses 10,000 as the limit instead of 11,000 for safety margin
  • Only affects QueryRange type queries (regular Query types don't have this issue)
  • The adjustment is transparent - users don't need to change their configuration
  • Debug logging shows when adjustment occurs for troubleshooting

…tion

When querying Prometheus with long history durations and small step sizes,
the number of data points can exceed Prometheus's maximum resolution of
11,000 points per timeseries.

This fix automatically increases the step size when the calculated number
of points would exceed 10,000 (using 10,000 as a safety margin below the
11,000 hard limit).

Fixes #490
@coderabbitai
Copy link

coderabbitai bot commented Jan 26, 2026

Walkthrough

Added a safety mechanism to Prometheus query step size calculation. Introduces a helper method that adjusts the query step size to ensure the resulting data point count stays under 10,000 points, preventing Prometheus errors from oversized queries.

Changes

Cohort / File(s) Summary
Prometheus Query Safety
robusta_krr/core/integrations/prometheus/metrics/base.py
Added module-level logger and MAX_PROMETHEUS_POINTS constant (10,000). Introduced _calculate_safe_step() method to dynamically adjust query step sizes based on query period. Method ensures data point count (period/step) does not exceed the limit by increasing step when necessary. Applied adjustment in data loading for QueryRange queries.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: auto-adjusting Prometheus step size to prevent exceeding the maximum resolution limit, which is the core objective of this PR.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the problem, root cause, solution, and testing approach for the Prometheus max resolution fix.
Linked Issues check ✅ Passed The PR fully addresses issue #490 by implementing automatic step adjustment to prevent exceeding Prometheus's 11,000-point limit, with transparent adjustment and debug logging as required.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the Prometheus resolution limit issue in the base.py file, with no extraneous modifications or unrelated code changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/prometheus-max-resolution-490

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@robusta_krr/core/integrations/prometheus/metrics/base.py`:
- Around line 127-162: The _calculate_safe_step function currently uses the raw
step.total_seconds() while the actual query later rounds step to whole seconds,
which can still exceed MAX_PROMETHEUS_POINTS for fractional steps; modify
_calculate_safe_step to normalize the incoming step to the integer-second
resolution used for queries (e.g., round/truncate step.total_seconds() to an int
second value before computing expected_points and min_step_seconds) and return a
timedelta based on that normalized integer seconds, and ensure the same
normalized integer second value is used when constructing step_str so the safety
calculation and actual query step are identical (refer to _calculate_safe_step
and the code that builds step_str).

Comment on lines +127 to +162
def _calculate_safe_step(self, period: datetime.timedelta, step: datetime.timedelta) -> datetime.timedelta:
"""
Calculate a step size that won't exceed Prometheus's maximum resolution limit.
If the number of data points (period / step) would exceed MAX_PROMETHEUS_POINTS,
this function returns an increased step size that keeps the point count under the limit.
Args:
period: The time period for the query.
step: The originally requested step size.
Returns:
A step size that keeps the number of data points under MAX_PROMETHEUS_POINTS.
"""
period_seconds = period.total_seconds()
step_seconds = step.total_seconds()

# Calculate expected number of points
expected_points = period_seconds / step_seconds

if expected_points <= MAX_PROMETHEUS_POINTS:
return step

# Calculate the minimum step size needed to stay under the limit
min_step_seconds = period_seconds / MAX_PROMETHEUS_POINTS

# Round up to the nearest second to ensure we're under the limit
adjusted_step_seconds = int(min_step_seconds) + 1

logger.debug(
f"Adjusting step from {step_seconds}s to {adjusted_step_seconds}s to avoid exceeding "
f"Prometheus max resolution ({expected_points:.0f} points -> {period_seconds / adjusted_step_seconds:.0f} points)"
)

return datetime.timedelta(seconds=adjusted_step_seconds)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align safe-step math with the actual query step to avoid still exceeding the limit.

_calculate_safe_step uses the raw step.total_seconds(), but Line 218 rounds the step to whole seconds for the actual query. For fractional steps (e.g., 1.49s), rounding down can inflate the real point count and still hit Prometheus’s 11k cap. Normalize to the same integer-second resolution inside _calculate_safe_step and use that value when building step_str.

🔧 Proposed fix
@@
-import logging
+import logging
+import math
@@
-        step_seconds = step.total_seconds()
+        # Normalize to the same integer-second resolution used in the query
+        step_seconds = max(1, round(step.total_seconds()))
@@
-        if expected_points <= MAX_PROMETHEUS_POINTS:
-            return step
+        if expected_points <= MAX_PROMETHEUS_POINTS:
+            return datetime.timedelta(seconds=step_seconds)
@@
-        min_step_seconds = period_seconds / MAX_PROMETHEUS_POINTS
-
-        # Round up to the nearest second to ensure we're under the limit
-        adjusted_step_seconds = int(min_step_seconds) + 1
+        min_step_seconds = math.ceil(period_seconds / MAX_PROMETHEUS_POINTS)
+
+        # Round up to the nearest second to ensure we're under the limit
+        adjusted_step_seconds = max(step_seconds, min_step_seconds)
@@
-        step_str = f"{round(step.total_seconds())}s"
+        step_str = f"{int(step.total_seconds())}s"

Also applies to: 214-218

🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/prometheus/metrics/base.py` around lines 127 -
162, The _calculate_safe_step function currently uses the raw
step.total_seconds() while the actual query later rounds step to whole seconds,
which can still exceed MAX_PROMETHEUS_POINTS for fractional steps; modify
_calculate_safe_step to normalize the incoming step to the integer-second
resolution used for queries (e.g., round/truncate step.total_seconds() to an int
second value before computing expected_points and min_step_seconds) and return a
timedelta based on that normalized integer seconds, and ensure the same
normalized integer second value is used when constructing step_str so the safety
calculation and actual query step are identical (refer to _calculate_safe_step
and the code that builds step_str).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed to gather resource history exceeded maximum resolution of 11,000 points per timeseries

2 participants