-
Notifications
You must be signed in to change notification settings - Fork 258
fix: Auto-adjust step size to prevent exceeding Prometheus max resolution #503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tion When querying Prometheus with long history durations and small step sizes, the number of data points can exceed Prometheus's maximum resolution of 11,000 points per timeseries. This fix automatically increases the step size when the calculated number of points would exceed 10,000 (using 10,000 as a safety margin below the 11,000 hard limit). Fixes #490
WalkthroughAdded a safety mechanism to Prometheus query step size calculation. Introduces a helper method that adjusts the query step size to ensure the resulting data point count stays under 10,000 points, preventing Prometheus errors from oversized queries. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@robusta_krr/core/integrations/prometheus/metrics/base.py`:
- Around line 127-162: The _calculate_safe_step function currently uses the raw
step.total_seconds() while the actual query later rounds step to whole seconds,
which can still exceed MAX_PROMETHEUS_POINTS for fractional steps; modify
_calculate_safe_step to normalize the incoming step to the integer-second
resolution used for queries (e.g., round/truncate step.total_seconds() to an int
second value before computing expected_points and min_step_seconds) and return a
timedelta based on that normalized integer seconds, and ensure the same
normalized integer second value is used when constructing step_str so the safety
calculation and actual query step are identical (refer to _calculate_safe_step
and the code that builds step_str).
| def _calculate_safe_step(self, period: datetime.timedelta, step: datetime.timedelta) -> datetime.timedelta: | ||
| """ | ||
| Calculate a step size that won't exceed Prometheus's maximum resolution limit. | ||
| If the number of data points (period / step) would exceed MAX_PROMETHEUS_POINTS, | ||
| this function returns an increased step size that keeps the point count under the limit. | ||
| Args: | ||
| period: The time period for the query. | ||
| step: The originally requested step size. | ||
| Returns: | ||
| A step size that keeps the number of data points under MAX_PROMETHEUS_POINTS. | ||
| """ | ||
| period_seconds = period.total_seconds() | ||
| step_seconds = step.total_seconds() | ||
|
|
||
| # Calculate expected number of points | ||
| expected_points = period_seconds / step_seconds | ||
|
|
||
| if expected_points <= MAX_PROMETHEUS_POINTS: | ||
| return step | ||
|
|
||
| # Calculate the minimum step size needed to stay under the limit | ||
| min_step_seconds = period_seconds / MAX_PROMETHEUS_POINTS | ||
|
|
||
| # Round up to the nearest second to ensure we're under the limit | ||
| adjusted_step_seconds = int(min_step_seconds) + 1 | ||
|
|
||
| logger.debug( | ||
| f"Adjusting step from {step_seconds}s to {adjusted_step_seconds}s to avoid exceeding " | ||
| f"Prometheus max resolution ({expected_points:.0f} points -> {period_seconds / adjusted_step_seconds:.0f} points)" | ||
| ) | ||
|
|
||
| return datetime.timedelta(seconds=adjusted_step_seconds) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align safe-step math with the actual query step to avoid still exceeding the limit.
_calculate_safe_step uses the raw step.total_seconds(), but Line 218 rounds the step to whole seconds for the actual query. For fractional steps (e.g., 1.49s), rounding down can inflate the real point count and still hit Prometheus’s 11k cap. Normalize to the same integer-second resolution inside _calculate_safe_step and use that value when building step_str.
🔧 Proposed fix
@@
-import logging
+import logging
+import math
@@
- step_seconds = step.total_seconds()
+ # Normalize to the same integer-second resolution used in the query
+ step_seconds = max(1, round(step.total_seconds()))
@@
- if expected_points <= MAX_PROMETHEUS_POINTS:
- return step
+ if expected_points <= MAX_PROMETHEUS_POINTS:
+ return datetime.timedelta(seconds=step_seconds)
@@
- min_step_seconds = period_seconds / MAX_PROMETHEUS_POINTS
-
- # Round up to the nearest second to ensure we're under the limit
- adjusted_step_seconds = int(min_step_seconds) + 1
+ min_step_seconds = math.ceil(period_seconds / MAX_PROMETHEUS_POINTS)
+
+ # Round up to the nearest second to ensure we're under the limit
+ adjusted_step_seconds = max(step_seconds, min_step_seconds)
@@
- step_str = f"{round(step.total_seconds())}s"
+ step_str = f"{int(step.total_seconds())}s"Also applies to: 214-218
🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/prometheus/metrics/base.py` around lines 127 -
162, The _calculate_safe_step function currently uses the raw
step.total_seconds() while the actual query later rounds step to whole seconds,
which can still exceed MAX_PROMETHEUS_POINTS for fractional steps; modify
_calculate_safe_step to normalize the incoming step to the integer-second
resolution used for queries (e.g., round/truncate step.total_seconds() to an int
second value before computing expected_points and min_step_seconds) and return a
timedelta based on that normalized integer seconds, and ensure the same
normalized integer second value is used when constructing step_str so the safety
calculation and actual query step are identical (refer to _calculate_safe_step
and the code that builds step_str).
Summary
Fixes #490
When using
simple-limitstrategy with long history durations and small step sizes, the number of data points can exceed Prometheus's maximum resolution of 11,000 points per timeseries, causing queries to fail with:Root Cause
The default settings are:
history_duration: 336 hours (2 weeks)timeframe_duration: 1.25 minutes (75 seconds)This results in:
336 * 60 * 60 / 75 = 16,128 data points- which exceeds the 11,000 limit.The
simplestrategy works because it usesPercentileCPULoaderwithQueryType.Query(a simple instant query), whilesimple-limitusesCPULoaderwithQueryType.QueryRange(a range query that returns all data points).Solution
This PR adds automatic step adjustment in the
load_datamethod ofPrometheusMetric:QueryRangequeries, calculate the expected number of data pointsMAX_PROMETHEUS_POINTS(10,000 - using a safety margin below the 11,000 hard limit), automatically increase the step sizeChanges
MAX_PROMETHEUS_POINTSconstant (10,000)_calculate_safe_step()method to compute adjusted step when neededload_data()to call_calculate_safe_step()for range queriesTesting
This fix can be tested by running:
Before this fix: Fails with "exceeded maximum resolution" error
After this fix: Automatically adjusts step size and completes successfully
Considerations
QueryRangetype queries (regularQuerytypes don't have this issue)