-
Notifications
You must be signed in to change notification settings - Fork 258
Changelog - AI Strategy & GCP/Anthos Support #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
WalkthroughAdds an AI-Assisted resource recommendation strategy (multi-provider support: OpenAI, Gemini, Anthropic, Ollama), extensive AI provider integrations, prompt/statistics utilities, GCP Managed Prometheus and Anthos metric loaders/services, CLI/config flags, docs, examples, and comprehensive tests. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as CLI/Config
participant Runner as KRR Runner
participant Strategy as AiAssistedStrategy
participant PromService as Prometheus Service
participant AIProvider as AI Provider
participant Metrics as Metric Loaders
User->>CLI: Run with AI strategy options
CLI->>Runner: Create Config & invoke run
Runner->>Strategy: Instantiate AiAssistedStrategy
Strategy->>Strategy: _detect_provider() (config/env)
Strategy->>AIProvider: Initialize chosen provider
Runner->>PromService: gather metrics for workload
PromService->>Metrics: Query CPU/Memory loaders
Metrics-->>PromService: Return time series
PromService-->>Strategy: Provide MetricsPodData
Strategy->>Strategy: extract_comprehensive_stats()
Strategy->>Strategy: format_messages()
Strategy->>AIProvider: analyze_metrics(messages, temp, max_tokens)
AIProvider->>AIProvider: HTTP POST -> provider API
AIProvider-->>Strategy: recommendations + reasoning
Strategy->>Strategy: _sanity_check() & clamp values
Strategy-->>Runner: Return ResourceRecommendation
Runner-->>User: Output recommendations
sequenceDiagram
actor User
participant Loader as PrometheusLoader
participant ServiceSelector as Loader.selector
participant PromService as PrometheusMetricsService
participant GCPService as GcpManagedPrometheusMetricsService
participant AnthosService as AnthosMetricsService
User->>Loader: provide --prometheus-url (monitoring.googleapis.com)
Loader->>ServiceSelector: inspect URL & gcp_anthos flag
alt gcp_anthos = true
ServiceSelector->>AnthosService: select AnthosMetricsService
else gcp_anthos = false
ServiceSelector->>GCPService: select GcpManagedPrometheusMetricsService
end
ServiceSelector->>PromService: (default) select PrometheusMetricsService if not GCP
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In `@robusta_krr/core/integrations/ai/gemini_provider.py`:
- Around line 10-21: Update _get_endpoint to stop embedding the API key in the
URL (remove the ?key={self.api_key} fragment) and modify _get_headers to include
the header "x-goog-api-key": self.api_key in addition to "Content-Type":
"application/json"; specifically change the methods _get_endpoint and
_get_headers so endpoints use f".../{self.model}:generateContent" and headers
return {"Content-Type": "application/json", "x-goog-api-key": self.api_key} to
move authentication into a header.
In `@robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py`:
- Around line 156-187: The current query in variable query multiplies memory
limit by restart_count to infer OOMs, which can inflate values; instead, check
for and prefer direct OOM-related metrics from GCP Managed Prometheus (e.g.,
container_oom_events_total, container_memory_failcnt,
container_start_time_seconds) and update the metric selection logic in memory.py
to use those when present (fall back to the existing limit-based approach only
if none are available). Concretely: detect availability of each metric via the
Prometheus API or by attempting a short query, then modify the query
construction around query / the code that uses object, pods_selector,
cluster_label, duration, step to use the direct OOM metrics; if none exist,
simplify the fallback (e.g., treat any restart_count>0 as a boolean signal or
return the raw memory limit instead of multiplying by restart_count) and update
the class/function docstring to explicitly document this limitation and the
chosen fallback.
In `@robusta_krr/core/models/config.py`:
- Around line 154-157: The current debug call logs raw other_args which may
contain secrets; change the logging in the method that creates strategy settings
(the logger.debug before calling StrategySettingsType and the subsequent debug)
to avoid printing raw values from other_args—either log only other_args.keys()
or construct a masked copy that replaces values for sensitive keys (eg.
"api_key", "token", "password", "secret") before passing to logger.debug; keep
creation via StrategySettingsType(**self.other_args) and continue returning
StrategyType(settings) but ensure only the safe representation (keys or masked
dict) is logged, not the original other_args.
In `@robusta_krr/strategies/ai_assisted.py`:
- Around line 302-323: The HPA skip logic in the block that checks
object_data.hpa and self.settings.allow_hpa is incorrect because it only returns
when both cpu_rec and memory_rec are set, allowing AI recommendations to
continue when only one HPA target exists; change the logic to immediately skip
AI by returning a map for both ResourceType.CPU and ResourceType.Memory using
ResourceRecommendation.undefined(info="HPA detected") (or None only if your
callers expect None for absent recs) whenever object_data.hpa is not None and
not self.settings.allow_hpa so that object_data.hpa, self.settings.allow_hpa,
ResourceRecommendation, and ResourceType are used to build and return the skip
result.
In `@robusta_krr/strategies/ai_prompts.py.backup`:
- Around line 17-97: The file contains a backup version where
extract_comprehensive_stats builds prompt_parts and returns a string instead of
returning the statistics dict expected by AiAssistedStrategy; remove or fix this
artifact by restoring extract_comprehensive_stats to return the stats dict (use
the later-constructed stats object) and move the prompt assembly (prompt_parts,
joins, and human-readable sections) into a separate function like
build_prompt_from_stats so AiAssistedStrategy can import
extract_comprehensive_stats (which should return dict) and call
build_prompt_from_stats(stats) when a string prompt is needed; alternatively
delete the .py.backup file if it is unintended to ship.
- Around line 100-133: The code in format_messages uses
settings.ai_include_simple_reference which doesn't exist on
AiAssistedStrategySettings; replace this with the inverse of the existing flag
by passing include_simple_ref=not settings.ai_exclude_simple_reference into
get_system_prompt (or compute a local include_simple_ref variable from
settings.ai_exclude_simple_reference and use that) so the system prompt receives
the correct boolean; update references in format_messages (and any related
callers) to use the corrected flag logic.
In `@test_gcp_quick.sh`:
- Line 74: The script echoes and passes CPU_PERCENTILE but its definition is
commented out; restore or set a default for CPU_PERCENTILE so it is never empty
when used by krr.py and in the echo. Re-enable the original CPU_PERCENTILE
export (the commented block that sets CPU_PERCENTILE) or add a fallback
assignment (e.g., set CPU_PERCENTILE to a sensible default using parameter
expansion) so references to CPU_PERCENTILE in the echo and the krr.py
--cpu-percentile flags receive a valid value.
In `@tests/test_gcp_loaders.py`:
- Around line 205-225: The test test_loader_mapping incorrectly expects
GcpManagedPrometheusMetricsService.LOADER_MAPPING["MaxOOMKilledMemoryLoader"] to
be None while the implementation maps it to GcpMaxOOMKilledMemoryLoader; update
the assertion in test_loader_mapping to assert that "MaxOOMKilledMemoryLoader"
maps to GcpMaxOOMKilledMemoryLoader (or otherwise reflect the intended contract)
by referencing GcpManagedPrometheusMetricsService.LOADER_MAPPING and the
GcpMaxOOMKilledMemoryLoader symbol.
🟡 Minor comments (16)
test_gcp_quick.sh-30-40 (1)
30-40: Color variables used before definition.
${RED}and${NC}are referenced on lines 31-32 but defined on lines 37-40. This will cause the error message to appear without color formatting.🐛 Proposed fix: move color definitions earlier
+# Colors +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' + if [ -z "${PROJECT_ID:-}" ] || [ -z "${CLUSTER_NAME:-}" ]; then echo -e "${RED}Error: PROJECT_ID and CLUSTER_NAME must be defined in .env or via environment variables.${NC}" exit 1 fi - - -# Colors -RED='\033[0;31m' -GREEN='\033[0;32m' -YELLOW='\033[1;33m' -NC='\033[0m'robusta_krr/core/integrations/prometheus/metrics/gcp/README.md-97-100 (1)
97-100: Documentation inconsistency:MaxOOMKilledMemoryLoaderis actually implemented.The limitations section states that
MaxOOMKilledMemoryLoaderis not implemented, butmemory.pyin this same package does implementGcpMaxOOMKilledMemoryLoaderusing an inference-based approach (combiningmemory/limit_bytesandrestart_count).Consider updating this section to reflect the actual implementation and its inference-based nature:
📝 Suggested documentation update
## Limitations -- **MaxOOMKilledMemoryLoader**: Not implemented because it depends on `kube-state-metrics` which may not be available in GCP Managed Prometheus. +- **MaxOOMKilledMemoryLoader**: Uses an inference-based approach combining `kubernetes.io/container/memory/limit_bytes` and `kubernetes.io/container/restart_count` since `kube-state-metrics` is not available in GCP Managed Prometheus. This may produce false positives if containers restart for reasons other than OOM.test_gcp_quick.sh-140-147 (1)
140-147: Exit code capture is unreachable on failure due toset -e.With
set -eenabled (line 2), the script exits immediately ifkrr.pyreturns a non-zero exit code. Thus,EXIT_CODE=$?on line 140 will only ever capture0, making the failure branch (lines 146-147) unreachable.🐛 Proposed fix: disable errexit for the krr.py command
if [ "${AI_MODE:-false}" = "true" ]; then echo -e "${YELLOW}AI Mode enabled: Using AI-assisted strategy with Gemini 3 Flash Preview model.${NC}" - $PYTHON_CMD krr.py ai-assisted \ + set +e + $PYTHON_CMD krr.py ai-assisted \ ... + EXIT_CODE=$? + set -e - else echo -e "${YELLOW}AI Mode disabled: Using standard KRR strategies.${NC}" - $PYTHON_CMD krr.py simple \ + set +e + $PYTHON_CMD krr.py simple \ ... + EXIT_CODE=$? + set -e fi - -EXIT_CODE=$?Alternatively, capture exit code inline:
$PYTHON_CMD krr.py ... || EXIT_CODE=$?CHANGES_GCP.md-44-87 (1)
44-87: Add language identifiers to fenced code blocksmarkdownlint reports MD040 at multiple fences. Adding a language (e.g.,
text,bash,promql) will fix it.🛠️ Example tweak
-``` +```text ============================== 75 passed in 5.20s ==============================</details> Also applies to: 337-353, 558-567 </blockquote></details> <details> <summary>CHANGES_GCP.md-610-667 (1)</summary><blockquote> `610-667`: **Remove or rename duplicate headings (MD024)** Headings like “Usage Examples”, “Technical Highlights”, “Debugging”, and “Changelog” appear multiple times. Consolidate or rename to satisfy markdownlint. </blockquote></details> <details> <summary>robusta_krr/core/integrations/prometheus/metrics_service/gcp_metrics_service.py-43-52 (1)</summary><blockquote> `43-52`: **Use correct parameter name to maintain interface consistency** The parameter should be named `api_client` (matching the base class and all other implementations), not `_api_client`. While the current codebase doesn't call this method with keyword arguments, keeping the signature consistent with `MetricsServiceDiscovery.find_metrics_url` is important for maintainability and prevents future issues. <details> <summary>🛠️ Proposed fix</summary> ```diff - def find_metrics_url(self, *, _api_client: Optional[ApiClient] = None) -> Optional[str]: + def find_metrics_url(self, *, api_client: Optional[ApiClient] = None) -> Optional[str]: + _ = api_client # unused """ GCP Managed Prometheus is typically accessed via a known URL pattern:ai_strategy_examples.sh-302-311 (1)
302-311: Handle missinggcloudbefore token fetchWith
set -e, a missinggcloudbinary will cause the script to exit before your error handling executes. Add an explicit check so users get a clear message instead of an abrupt failure.🛠️ Proposed fix
# Get GCP access token echo "" echo -e "${YELLOW}Getting GCP access token...${NC}" + if ! command -v gcloud >/dev/null 2>&1; then + echo -e "${RED}ERROR: gcloud CLI not found${NC}" + echo "Install the Google Cloud SDK and run: gcloud auth login" + return + fi TOKEN=$(gcloud auth print-access-token 2>/dev/null)robusta_krr/core/integrations/ai/README.md-7-19 (1)
7-19: Add language identifiers to fenced code blocks (markdownlint MD040).These fences are missing language tags, which will fail markdownlint and reduce readability.
🔧 Example fix (apply similarly to other blocks)
-``` +```text robusta_krr/ ├── core/integrations/ai/ ... -``` +```Also applies to: 264-272, 275-283, 286-289, 297-305
docs/gcp-managed-prometheus-integration.md-146-155 (1)
146-155: Deduplicate loader mapping bullets.
MemoryLoaderandMaxMemoryLoaderare listed twice; this reads like a copy/paste error.📝 Proposed fix
- `MemoryLoader` → `GcpMemoryLoader` - `MaxMemoryLoader` → `GcpMaxMemoryLoader` -- `MemoryLoader` → `GcpMemoryLoader` -- `MaxMemoryLoader` → `GcpMaxMemoryLoader` - `MemoryAmountLoader` → `GcpMemoryAmountLoader`robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/__init__.py-22-29 (1)
22-29: Sort__all__to satisfy Ruff RUF022.🔧 Proposed fix
__all__ = [ - "AnthosCPULoader", - "AnthosPercentileCPULoader", - "AnthosCPUAmountLoader", - "AnthosMemoryLoader", - "AnthosMaxMemoryLoader", - "AnthosMemoryAmountLoader", - "AnthosMaxOOMKilledMemoryLoader", + "AnthosCPUAmountLoader", + "AnthosCPULoader", + "AnthosMaxMemoryLoader", + "AnthosMaxOOMKilledMemoryLoader", + "AnthosMemoryAmountLoader", + "AnthosMemoryLoader", + "AnthosPercentileCPULoader", ]docs/ai-assisted-strategy.md-152-156 (1)
152-156: Add language identifiers to fenced blocks.markdownlint MD040 flags these fences. Suggest
textfor output snippets.✏️ Suggested edits
-``` +```text | Namespace | Name | Container | CPU Request | CPU Limit | Memory Request | Memory Limit | Info | |-----------|----------------|-----------|-------------|-----------|----------------|--------------|-----------------------------------------------------------| | default | nginx-deploy | nginx | 250m | - | 512Mi | 512Mi | AI: Based on p95 CPU at 0.18 cores with... (conf: 85%) | -``` +``` -``` +```text Error: No AI provider API key found. Set OPENAI_API_KEY, GEMINI_API_KEY, or ANTHROPIC_API_KEY -``` +``` -``` +```text Error: Rate limit exceeded for API calls -``` +``` -``` +```text AI: Insufficient data for reliable... (conf: 30%) -``` +``` -``` +```text Error: Failed to connect to Ollama at http://localhost:11434 -``` +```Also applies to: 219-221, 230-232, 241-243, 252-254
docs/ai-assisted-strategy.md-234-237 (1)
234-237: Hyphenate the compound modifier.Use “higher‑tier API plan” (per LanguageTool hint).
✏️ Suggested edit
-- Switch to a higher tier API plan +- Switch to a higher-tier API planrobusta_krr/core/integrations/ai/openai_provider.py-21-47 (1)
21-47: Add model validation for JSON mode support to prevent 4xx errors.
response_format={"type":"json_object"}is only supported by certain OpenAI models (gpt-3.5-turbo and gpt-4-* family). If a user selects an unsupported model, the request will fail with a 400 error. Consider either validating the model against a supported list before requesting or adding a fallback that omits JSON mode for unsupported models.AI_STRATEGY_IMPLEMENTATION.md-275-289 (1)
275-289: File count mismatch in documentation.The heading states "Created (11 files)" but the list contains 12 items (numbered 1-12). Update the count to match.
📝 Suggested fix
-### Created (11 files): +### Created (12 files):robusta_krr/core/integrations/ai/anthropic_provider.py-44-56 (1)
44-56: Handle edge case when all messages are system messages.If all messages in the input list have
role: "system",conversation_messageswill be empty. The Anthropic API requires at least one non-system message, which would cause an API error.🛡️ Suggested defensive check
for msg in messages: if msg.get("role") == "system": system_message = msg["content"] else: conversation_messages.append(msg) + # Ensure at least one user message exists + if not conversation_messages: + conversation_messages = [{"role": "user", "content": "Analyze the provided data."}] + body = { "model": self.model,robusta_krr/strategies/ai_assisted.py-166-191 (1)
166-191: Avoid mismatching provider and API key whenai_provideris explicit.
Ifai_provideris set butai_api_keyis not, the loop still picks the first available env var (e.g.,OPENAI_API_KEY) and then overrides the provider to the explicit setting, potentially pairing the wrong key with the chosen provider. Consider restricting env lookup to the explicit provider or raising if its key is missing.
🧹 Nitpick comments (14)
robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py (1)
188-194: Inconsistent log level: uselogger.debug()for consistency.Other loaders in this file use
logger.debug()for query logging, but this one useslogger.info(). Unless there's a specific reason to log OOM queries at a higher level, consider usingdebugfor consistency.♻️ Suggested fix
- logger.info( + logger.debug( "GCP OOM detection query (inference-based using restart_count + memory limit) for %s/%s/%s:", object.namespace, object.name, object.container, )robusta_krr/core/integrations/prometheus/metrics_service/gcp_metrics_service.py (1)
95-116: Preferlogger.exceptionto capture stack tracesSince this is inside an exception handler,
logger.exception(...)will capture the traceback automatically and usually removes the need for manual “Cause” formatting.♻️ Suggested tweak
- except MetricsNotFound as e: - logger.error( - "Failed to connect to GCP Managed Prometheus at %s. Verify the URL, " - "authentication token, and that Managed Service for Prometheus is enabled." - " Cause: %s: %s", - self.url, - e.__class__.__name__, - e, - ) + except MetricsNotFound as e: + logger.exception( + "Failed to connect to GCP Managed Prometheus at %s. Verify the URL, " + "authentication token, and that Managed Service for Prometheus is enabled.", + self.url, + ) raise MetricsNotFound( f"Couldn't connect to GCP Managed Prometheus at {self.url}. See logs for details." ) from erobusta_krr/core/integrations/ai/ollama_provider.py (1)
20-25: NormalizeOLLAMA_HOSTto avoid double slashes.If users set
OLLAMA_HOSTwith a trailing/, the endpoint becomes//api/generate. It’s usually accepted, but easy to normalize.♻️ Proposed tweak
- self.host = os.environ.get("OLLAMA_HOST", "http://localhost:11434") + self.host = os.environ.get("OLLAMA_HOST", "http://localhost:11434").rstrip("/")AI_STRATEGY_IMPLEMENTATION.md (1)
221-236: Add language specifier to fenced code block.The code block at line 221 is missing a language specifier. This helps with syntax highlighting and markdown linting compliance.
📝 Suggested fix
-``` +```text ╭─ Strategy Settings ─────────────────────────────╮robusta_krr/core/integrations/prometheus/metrics/gcp/__init__.py (1)
13-21: Consider sorting__all__for consistency.Static analysis suggests sorting
__all__alphabetically for easier maintenance.♻️ Suggested fix
__all__ = [ + "GcpCPUAmountLoader", "GcpCPULoader", + "GcpMaxMemoryLoader", + "GcpMaxOOMKilledMemoryLoader", + "GcpMemoryAmountLoader", + "GcpMemoryLoader", "GcpPercentileCPULoader", - "GcpCPUAmountLoader", - "GcpMemoryLoader", - "GcpMaxMemoryLoader", - "GcpMemoryAmountLoader", - "GcpMaxOOMKilledMemoryLoader", ]tests/test_ai_strategy.py (3)
3-6: Remove unused imports.
MagicMockandHPADataare imported but never used in this test file.♻️ Suggested fix
import json -from unittest.mock import Mock, patch, MagicMock +from unittest.mock import Mock, patch import numpy as np import pytest from robusta_krr.core.abstract.strategies import MetricsPodData, ResourceType -from robusta_krr.core.models.objects import K8sObjectData, PodData, HPAData +from robusta_krr.core.models.objects import K8sObjectData, PodData
233-248: Remove unused variables in test_compact_mode.
settings_fullandsettings_compactare created but never used. The test callsget_user_promptdirectly with thecompactparameter instead of using the settings objects.♻️ Suggested fix
def test_compact_mode(self, sample_history_data, sample_object_data): """Test compact mode reduces prompt length.""" stats = ai_prompts.extract_comprehensive_stats( sample_history_data, sample_object_data ) - settings_full = AiAssistedStrategySettings(ai_compact_mode=False) - settings_compact = AiAssistedStrategySettings(ai_compact_mode=True) - full_prompt = ai_prompts.get_user_prompt(stats, compact=False) compact_prompt = ai_prompts.get_user_prompt(stats, compact=True)
30-61: Test data uses random values which may cause flaky tests.The fixture uses
np.random.normal()without a fixed seed, which could theoretically cause test flakiness if assertions depend on specific statistical properties. Consider setting a random seed for reproducibility.♻️ Suggested fix
`@pytest.fixture` def sample_history_data() -> MetricsPodData: """Create sample Prometheus metrics data.""" + np.random.seed(42) # Fixed seed for reproducibility # CPU data: 100 time points with values around 0.2 cores cpu_timestamps = np.linspace(0, 3600, 100) cpu_values = np.random.normal(0.2, 0.05, 100)robusta_krr/core/integrations/ai/base.py (1)
158-171: Improve exception logging and chaining.Several exception handlers could be improved:
- Use
logger.exception()instead oflogger.error()to include stack traces- Chain exceptions with
raise ... from efor better debugging- The timeout exception variable
eis unused♻️ Suggested improvements
except requests.HTTPError as e: - logger.error( + logger.exception( f"HTTP error from {self.__class__.__name__}: {e.response.status_code} - {e.response.text}" ) raise - except requests.Timeout as e: - logger.error(f"Timeout calling {self.__class__.__name__} API after {self.timeout}s") + except requests.Timeout: + logger.exception(f"Timeout calling {self.__class__.__name__} API after {self.timeout}s") raise except requests.RequestException as e: - logger.error(f"Request error calling {self.__class__.__name__}: {e}") + logger.exception(f"Request error calling {self.__class__.__name__}: {e}") raise except (KeyError, IndexError) as e: - logger.error(f"Failed to parse response from {self.__class__.__name__}: {e}") - raise ValueError(f"Invalid response format from {self.__class__.__name__}: {e}") + logger.exception(f"Failed to parse response from {self.__class__.__name__}: {e}") + raise ValueError(f"Invalid response format from {self.__class__.__name__}: {e}") from erobusta_krr/strategies/ai_prompts.py (4)
91-106: Usenext(iter())instead oflist()[0]for efficiency.Converting to a list just to get the first element is inefficient. Use
next(iter())instead.♻️ Suggested fix
try: # Use timestamps and values from first pod for trend - first_pod_data = list(cpu_data.values())[0] + first_pod_data = next(iter(cpu_data.values())) if len(first_pod_data) > 1:
347-355: Remove extraneous f-string prefixes.Lines 350 and 351 have
fprefixes on strings without placeholders.♻️ Suggested fix
prompt_parts = [ f"## Workload: {workload['kind']} {workload['namespace']}/{workload['name']}", f"Container: {workload['container']}", - f"", - f"## Pod Information", + "", + "## Pod Information", f"- Current pods: {pods['current_count']}",
214-236: Unusedproviderparameter inget_system_prompt.The
providerparameter is defined but never used in the function body. If it's intended for future provider-specific prompts, consider documenting this or removing it.If this is intentional for future extensibility, add a comment or use it for provider-specific customization. Otherwise, remove the unused parameter.
491-523: Unusedobject_dataparameter informat_messages.The
object_dataparameter is passed but never used. The function signature accepts it, suggesting it may be intended for future enhancements (e.g., provider-specific formatting based on workload type).Consider either:
- Using
object_datafor provider-specific customizations- Adding a
# noqa: ARG001comment to suppress the warning if kept for API consistency- Removing if not needed
robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/memory.py (1)
123-183: Verify restart_count multiplication doesn’t inflate OOM “max memory.”
restart_countis cumulative; multiplying bymemory/limit_bytesscales the value by the number of restarts, while downstream treats it as a memory amount. Consider gating withclamp_max(..., 1)or a boolean> 0check (orincreaseover the window) to avoid exaggerated values.
| def _get_endpoint(self) -> str: | ||
| """Get Gemini API endpoint with API key in URL.""" | ||
| return ( | ||
| f"https://generativelanguage.googleapis.com/v1beta/models/" | ||
| f"{self.model}:generateContent?key={self.api_key}" | ||
| ) | ||
|
|
||
| def _get_headers(self) -> dict: | ||
| """Get headers for Gemini API.""" | ||
| return { | ||
| "Content-Type": "application/json" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
Gemini API authentication methods x-goog-api-key header vs key query parameter
💡 Result:
Short answer: use the x-goog-api-key header. Gemini (Generative Language) documentation shows the API key in the X-goog-api-key header and Google Cloud docs say the header is preferred; the key= query parameter is supported for compatibility but is less secure (exposes the key in URLs, logs, browser history). Use header or a server-side proxy, and restrict/regenerate keys if leaked. (ai.google.dev)
Citations:
Move Gemini API key from URL query parameter to header.
The current implementation embeds the API key in the URL query parameter, which exposes it to logs, proxies, and browser history. Gemini API supports header‑based authentication via the x-goog-api-key header, which is the preferred approach.
Proposed fix
def _get_endpoint(self) -> str:
- """Get Gemini API endpoint with API key in URL."""
+ """Get Gemini API endpoint."""
return (
f"https://generativelanguage.googleapis.com/v1beta/models/"
- f"{self.model}:generateContent?key={self.api_key}"
+ f"{self.model}:generateContent"
)
def _get_headers(self) -> dict:
"""Get headers for Gemini API."""
return {
- "Content-Type": "application/json"
+ "Content-Type": "application/json",
+ "x-goog-api-key": self.api_key,
}🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/ai/gemini_provider.py` around lines 10 - 21,
Update _get_endpoint to stop embedding the API key in the URL (remove the
?key={self.api_key} fragment) and modify _get_headers to include the header
"x-goog-api-key": self.api_key in addition to "Content-Type":
"application/json"; specifically change the methods _get_endpoint and
_get_headers so endpoints use f".../{self.model}:generateContent" and headers
return {"Content-Type": "application/json", "x-goog-api-key": self.api_key} to
move authentication into a header.
| query = f""" | ||
| label_replace( | ||
| label_replace( | ||
| max_over_time( | ||
| max( | ||
| max( | ||
| {{"__name__"="kubernetes.io/container/memory/limit_bytes", | ||
| "monitored_resource"="k8s_container", | ||
| "namespace_name"="{object.namespace}", | ||
| "pod_name"=~"{pods_selector}", | ||
| "container_name"="{object.container}"{cluster_label} | ||
| }} | ||
| ) by (pod_name, container_name, job) | ||
| * on(pod_name, container_name, job) group_left() | ||
| max( | ||
| {{"__name__"="kubernetes.io/container/restart_count", | ||
| "monitored_resource"="k8s_container", | ||
| "namespace_name"="{object.namespace}", | ||
| "pod_name"=~"{pods_selector}", | ||
| "container_name"="{object.container}"{cluster_label} | ||
| }} | ||
| ) by (pod_name, container_name, job) | ||
| ) by (container_name, pod_name, job) | ||
| [{duration}:{step}] | ||
| ), | ||
| "pod", "$1", "pod_name", "(.+)" | ||
| ), | ||
| "container", "$1", "container_name", "(.+)" | ||
| ) | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cd robusta_krr && head -n 200 core/integrations/prometheus/metrics/gcp/memory.py | tail -n 100Repository: robusta-dev/krr
Length of output: 4401
🏁 Script executed:
wc -l robusta_krr/core/integrations/prometheus/metrics/gcp/memory.pyRepository: robusta-dev/krr
Length of output: 125
🏁 Script executed:
cat -n robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py | sed -n '130,200p'Repository: robusta-dev/krr
Length of output: 3496
🏁 Script executed:
cat -n robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py | sed -n '1,50p'Repository: robusta-dev/krr
Length of output: 2258
🌐 Web query:
GCP Managed Prometheus OOM detection metrics kubernetes
💡 Result:
Short answer — use Managed Service for Prometheus to query the usual Kubernetes / cAdvisor / kube-state-metrics signals (PromQL) and combine them for reliable OOM detection. Example queries and caveats below.
Useful metrics
- kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} (kube-state-metrics) — shows last termination reason. [1][2]
- kube_pod_container_status_restarts_total (kube-state-metrics) — restart counter to correlate with termination reason. [2][4]
- container_oom_events_total or container_memory_oom_events_total (cAdvisor/cgroup) — monotonic OOM counters from the container runtime/cAdvisor. [3][4]
- node_vmstat_oom_kill (node-level) — kernel/node OOM kill events (useful when container metrics are missing). [4]
Example PromQL (run in Cloud Monitoring / Managed Prometheus)
-
Alert when a pod has been OOMKilled multiple times recently:
increase(kube_pod_container_status_restarts_total{container!=""}[15m])- on(namespace,pod) group_left(reason)
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0. [4]
- on(namespace,pod) group_left(reason)
-
Count OOM events per container over 1h (if container_oom_events_total is available):
increase(container_oom_events_total[1h]). [3][4] -
Node-level OOM spikes:
increase(node_vmstat_oom_kill[5m]) > 0. [4]
Caveats / recommendations
- kube_pod_container_status_last_terminated_reason is a gauge (shows last reason) and can miss events between scrapes; combine it with restart counters (or use monotonic counters) for reliable alerts. There are known gaps/edge-cases in kube-state-metrics. [2][5]
- cAdvisor/container counters (container_oom_events_total) are most reliable for counts but may not be present in every collection setup — validate your collector (managed/self-deployed) scrapes cAdvisor metrics. [3]
- Managed Service for Prometheus supports PromQL in Cloud Monitoring / Grafana — use it to run these queries and create alerts. [1][2]
Sources
- Managed Service for Prometheus / PromQL in Cloud Monitoring (overview + PromQL support). [Google Cloud Managed Prometheus / PromQL docs]. [1][2]
- GKE troubleshooting OOM events (how Kubernetes reports OOMKilled). [GKE Troubleshoot OOM events]. [4]
- cAdvisor/container OOM counters and usage in alerts. [cadvisor / forum & examples]. [3][4]
- Community examples and recommended query patterns (kube-state-metrics, container_oom_events_total, node_vmstat_oom_kill). [StackOverflow / blog posts / GH issue]. [4][5]
If you want, I can: (a) produce ready-to-add Cloud Monitoring alerting rules from one of the example queries, or (b) check which of those metrics your Managed Prometheus setup is actually ingesting and give a tailored query. Which would you like?
🌐 Web query:
Prometheus OOM detection without kube-state-metrics memory limit restart count
💡 Result:
Short answer: you can detect OOMs without kube‑state‑metrics by (a) parsing kernel OOM messages on the node (oom‑exporter / node logs), (b) using cAdvisor/cadvisored metrics (memory failcnt / RSS / start time) and (c) watching container start_time/restart from the runtime/cAdvisor metrics. Below are practical options + example PromQL.
Options
- Deploy an oom‑kill exporter (parses kernel logs like /dev/kmsg) and expose a metric such as klog_pod_oomkill; alert on changes. [1]
- Use node/cAdvisor metrics: watch container_memory_failcnt, container_memory_rss (or container_memory_working_set_bytes vs RSS) and detect sudden increases or failcnt > 0. Use these to infer hitting memory limits or OOM activity. [2][4]
- Detect container restarts directly from cAdvisor/runtime metrics by observing container_start_time_seconds (or other per‑container start timestamp) increasing — a restart implies the container was killed (could be OOM). You can alert on increases to start time or on increase(container_start_time_seconds[5m]) > 0. (This method does not require kube‑state‑metrics). [2][3]
- Collect node kernel/journal logs (journald/fluentd) and export OOM messages to Prometheus (custom exporter or existing projects). [1][2]
Example PromQL (adapt to your metric names/labels)
- Alert on kernel‑detected OOMs (oom exporter):
sum by(namespace,pod) (changes(klog_pod_oomkill[30m])) > 0. [1] - Alert on memory fail hits from cAdvisor:
increase(container_memory_failcnt{job="kubelet"}[10m]) > 0. [4] - Alert on container restarts using start time (detect restart):
increase(container_start_time_seconds{job="kubelet"}[10m]) > 0. [3] - Combine indicators (more robust):
(increase(container_start_time_seconds[10m]) > 0) and (rate(container_memory_rss_bytes[5m]) > some_threshold)
Notes / tradeoffs
- container_memory_failcnt indicates hits against memory limits, not necessarily kernel OOM kills; RSS is closer to actual resident memory that triggers OOM. [4]
- Kernel OOM events are most definitive but require node‑level log access or an exporter; kubelet/cAdvisor pipeline surfaces some OOM info but may miss kernel messages unless cadvisor/kubelet capture them. [2][3]
- If you can deploy kube‑state‑metrics later, kube_pod_container_status_restarts_total and last_terminated_reason (OOMKilled) give simpler pod‑level alerts.
References
- kubernetes-oomkill-exporter (parses kernel log, exposes klog_pod_oomkill). [1]
- LINE Engineering writeup on how OOM events are produced/watched (cadvisor, kernel logs, oomWatcher). [2]
- Kubernetes metrics reference (kubelet / cadvisor metrics like start times). [3]
- Discussion of container_memory_failcnt vs OOM behavior (StackOverflow). [4]
If you want, tell me which metric names you already scrape (cadvisor / node_exporter / journald) and I’ll give exact PromQL alerts for your setup.
The inference-based OOM detection approach is already documented as limited—investigate if better GCP metrics are available before accepting this trade-off.
The code's own docstring (lines 136–146) already acknowledges that this inference-based approach "may produce false positives if containers restart for reasons other than OOM while memory usage is high." However, the concern about multiplying limit_bytes × restart_count remains valid: this produces an inflated value proportional to restart count regardless of actual memory pressure, which could mislead the recommendation algorithm.
Before accepting this limitation, verify whether GCP Managed Prometheus exposes better metrics for OOM detection such as:
container_oom_events_total(cAdvisor—direct OOM event counter)container_memory_failcnt(cAdvisor—memory limit hits)container_start_time_seconds(runtime—detect restarts from timestamp changes)
If any of these are available in your GCP Prometheus setup, they would provide more reliable OOM signals than the current inference approach. If GCP truly lacks all alternatives, document this constraint explicitly in the class docstring and consider a simpler fallback (e.g., returning the memory limit when restarts exist) rather than multiplication.
🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py` around lines
156 - 187, The current query in variable query multiplies memory limit by
restart_count to infer OOMs, which can inflate values; instead, check for and
prefer direct OOM-related metrics from GCP Managed Prometheus (e.g.,
container_oom_events_total, container_memory_failcnt,
container_start_time_seconds) and update the metric selection logic in memory.py
to use those when present (fall back to the existing limit-based approach only
if none are available). Concretely: detect availability of each metric via the
Prometheus API or by attempting a short query, then modify the query
construction around query / the code that uses object, pods_selector,
cluster_label, duration, step to use the direct OOM metrics; if none exist,
simplify the fallback (e.g., treat any restart_count>0 as a boolean signal or
return the raw memory limit instead of multiplying by restart_count) and update
the class/function docstring to explicitly document this limitation and the
chosen fallback.
| logger.debug(f"Creating strategy '{self.strategy}' with other_args: {self.other_args}") | ||
| settings = StrategySettingsType(**self.other_args) | ||
| logger.debug(f"Strategy settings created with use_oomkill_data={getattr(settings, 'use_oomkill_data', 'NOT_FOUND')}") | ||
| return StrategyType(settings) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid logging raw strategy args (may contain secrets)
other_args can include API keys or tokens. Logging the dict can leak secrets to logs. Log only keys or mask sensitive fields.
🛠️ Proposed fix
- logger.debug(f"Creating strategy '{self.strategy}' with other_args: {self.other_args}")
+ logger.debug(
+ "Creating strategy '%s' with other_args keys: %s",
+ self.strategy,
+ list(self.other_args.keys()),
+ )🤖 Prompt for AI Agents
In `@robusta_krr/core/models/config.py` around lines 154 - 157, The current debug
call logs raw other_args which may contain secrets; change the logging in the
method that creates strategy settings (the logger.debug before calling
StrategySettingsType and the subsequent debug) to avoid printing raw values from
other_args—either log only other_args.keys() or construct a masked copy that
replaces values for sensitive keys (eg. "api_key", "token", "password",
"secret") before passing to logger.debug; keep creation via
StrategySettingsType(**self.other_args) and continue returning
StrategyType(settings) but ensure only the safe representation (keys or masked
dict) is logged, not the original other_args.
| # Check HPA if not allowed | ||
| if object_data.hpa is not None and not self.settings.allow_hpa: | ||
| logger.info( | ||
| f"{object_data.kind} {object_data.namespace}/{object_data.name}: " | ||
| f"HPA detected, skipping AI recommendations (use --allow-hpa to override)" | ||
| ) | ||
| if object_data.hpa.target_cpu_utilization_percentage is not None: | ||
| cpu_rec = ResourceRecommendation.undefined(info="HPA detected") | ||
| else: | ||
| cpu_rec = None | ||
|
|
||
| if object_data.hpa.target_memory_utilization_percentage is not None: | ||
| memory_rec = ResourceRecommendation.undefined(info="HPA detected") | ||
| else: | ||
| memory_rec = None | ||
|
|
||
| if cpu_rec and memory_rec: | ||
| return { | ||
| ResourceType.CPU: cpu_rec, | ||
| ResourceType.Memory: memory_rec, | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HPA skip logic doesn’t actually prevent AI recommendations.
When HPA targets only one resource, execution continues and AI recommendations are still produced for both resources, even though allow_hpa is false and the log says “skipping.” This violates the intended behavior of allow_hpa.
🛠️ Suggested fix (skip AI when HPA is present and allow_hpa is false)
- if object_data.hpa is not None and not self.settings.allow_hpa:
- logger.info(
- f"{object_data.kind} {object_data.namespace}/{object_data.name}: "
- f"HPA detected, skipping AI recommendations (use --allow-hpa to override)"
- )
- if object_data.hpa.target_cpu_utilization_percentage is not None:
- cpu_rec = ResourceRecommendation.undefined(info="HPA detected")
- else:
- cpu_rec = None
-
- if object_data.hpa.target_memory_utilization_percentage is not None:
- memory_rec = ResourceRecommendation.undefined(info="HPA detected")
- else:
- memory_rec = None
-
- if cpu_rec and memory_rec:
- return {
- ResourceType.CPU: cpu_rec,
- ResourceType.Memory: memory_rec,
- }
+ if object_data.hpa is not None and not self.settings.allow_hpa:
+ logger.info(
+ f"{object_data.kind} {object_data.namespace}/{object_data.name}: "
+ f"HPA detected, skipping AI recommendations (use --allow-hpa to override)"
+ )
+ return {
+ ResourceType.CPU: ResourceRecommendation.undefined(info="HPA detected"),
+ ResourceType.Memory: ResourceRecommendation.undefined(info="HPA detected"),
+ }🤖 Prompt for AI Agents
In `@robusta_krr/strategies/ai_assisted.py` around lines 302 - 323, The HPA skip
logic in the block that checks object_data.hpa and self.settings.allow_hpa is
incorrect because it only returns when both cpu_rec and memory_rec are set,
allowing AI recommendations to continue when only one HPA target exists; change
the logic to immediately skip AI by returning a map for both ResourceType.CPU
and ResourceType.Memory using ResourceRecommendation.undefined(info="HPA
detected") (or None only if your callers expect None for absent recs) whenever
object_data.hpa is not None and not self.settings.allow_hpa so that
object_data.hpa, self.settings.allow_hpa, ResourceRecommendation, and
ResourceType are used to build and return the skip result.
| def extract_comprehensive_stats( | ||
| history_data: MetricsPodData, | ||
| object_data: K8sObjectData | ||
| ) -> dict: | ||
| """Extract comprehensive statistics from Prometheus metrics data. | ||
|
|
||
| This function analyzes the historical data and extracts: | ||
| - CPU statistics (percentiles, mean, std, trend, spikes) | ||
| - Memory statistics (max, mean, std, OOMKills) | ||
| - Pod information (count, names, health) | ||
| - Workload context (HPA, allocations, labels) | ||
| - Temporal context (duration, data points) | ||
|
|
||
| Args: | ||
| history_data: Dictionary of metric loaders -> pod data | ||
| object_data: Kubernetes object metadata | ||
| ]) | ||
|
|
||
| if memory.get('oomkill_detected'): | ||
| oomkill_value = memory.get('oomkill_max_value', 0) | ||
| prompt_parts.append( | ||
| f"- OOM Kill max memory: {oomkill_value:.0f} bytes ({oomkill_value / 1024**2:.1f} Mi)" | ||
| ) | ||
|
|
||
| # Per-pod stats (first 3 pods only in full mode) | ||
| per_pod = memory.get("per_pod", {}) | ||
| if per_pod: | ||
| prompt_parts.append("\nPer-pod Memory (sample):") | ||
| for pod_name, pod_stats in list(per_pod.items())[:3]: | ||
| prompt_parts.append( | ||
| f"- {pod_name}: max={pod_stats['max']:.0f} bytes " | ||
| f"({pod_stats['max'] / 1024**2:.1f} Mi)" | ||
| ) | ||
|
|
||
| # Current Allocations | ||
| if allocations: | ||
| prompt_parts.append("\n## Current Resource Allocations") | ||
| cpu_alloc = allocations.get("cpu", {}) | ||
| mem_alloc = allocations.get("memory", {}) | ||
|
|
||
| cpu_req = cpu_alloc.get("request") | ||
| cpu_lim = cpu_alloc.get("limit") | ||
| mem_req = mem_alloc.get("request") | ||
| mem_lim = mem_alloc.get("limit") | ||
|
|
||
| prompt_parts.extend([ | ||
| f"CPU Request: {cpu_req if cpu_req else 'unset'}", | ||
| f"CPU Limit: {cpu_lim if cpu_lim else 'unset'}", | ||
| f"Memory Request: {mem_req if mem_req else 'unset'}", | ||
| f"Memory Limit: {mem_lim if mem_lim else 'unset'}", | ||
| ]) | ||
|
|
||
| # HPA Information | ||
| if hpa: | ||
| prompt_parts.append("\n## Horizontal Pod Autoscaler (HPA) Detected") | ||
| prompt_parts.extend([ | ||
| f"- Min replicas: {hpa['min_replicas']}", | ||
| f"- Max replicas: {hpa['max_replicas']}", | ||
| f"- Current replicas: {hpa['current_replicas']}", | ||
| ]) | ||
| if hpa['target_cpu_utilization']: | ||
| prompt_parts.append(f"- Target CPU utilization: {hpa['target_cpu_utilization']}%") | ||
| if hpa['target_memory_utilization']: | ||
| prompt_parts.append(f"- Target memory utilization: {hpa['target_memory_utilization']}%") | ||
| prompt_parts.append( | ||
| "NOTE: With HPA, be conservative with limits to allow scaling to work properly." | ||
| ) | ||
|
|
||
| # Warnings | ||
| warnings = stats.get("warnings", []) | ||
| if warnings: | ||
| prompt_parts.append("\n## Warnings") | ||
| for warning in warnings: | ||
| prompt_parts.append(f"- {warning}") | ||
|
|
||
| prompt_parts.append("\n## Your Task") | ||
| prompt_parts.append( | ||
| "Based on the above statistics, provide your resource recommendations in JSON format." | ||
| ) | ||
|
|
||
| return "\n".join(prompt_parts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract_comprehensive_stats appears interleaved with prompt-building and returns a string.
The function currently builds prompt_parts and returns a concatenated string, while a separate stats dict is constructed later. If this file is imported, AiAssistedStrategy will break when it expects a dict from extract_comprehensive_stats. This also reads like a merge/copy artifact, and the .py.backup filename suggests it may be unintended. Please clean this up or remove the backup file if it shouldn’t ship.
🤖 Prompt for AI Agents
In `@robusta_krr/strategies/ai_prompts.py.backup` around lines 17 - 97, The file
contains a backup version where extract_comprehensive_stats builds prompt_parts
and returns a string instead of returning the statistics dict expected by
AiAssistedStrategy; remove or fix this artifact by restoring
extract_comprehensive_stats to return the stats dict (use the later-constructed
stats object) and move the prompt assembly (prompt_parts, joins, and
human-readable sections) into a separate function like build_prompt_from_stats
so AiAssistedStrategy can import extract_comprehensive_stats (which should
return dict) and call build_prompt_from_stats(stats) when a string prompt is
needed; alternatively delete the .py.backup file if it is unintended to ship.
| def format_messages( | ||
| provider: str, | ||
| stats: dict, | ||
| object_data: K8sObjectData, | ||
| settings: "AiAssistedStrategySettings" | ||
| ) -> Union[list, str]: | ||
| """Format messages for the specific AI provider. | ||
|
|
||
| Args: | ||
| provider: AI provider name (openai, gemini, anthropic, ollama) | ||
| stats: Statistics dictionary | ||
| object_data: Kubernetes object data | ||
| settings: Strategy settings | ||
|
|
||
| Returns: | ||
| Messages in provider-specific format (list of dicts or string) | ||
| """ | ||
| system_prompt = get_system_prompt( | ||
| provider, | ||
| include_simple_ref=settings.ai_include_simple_reference | ||
| ) | ||
| user_prompt = get_user_prompt(stats, compact=settings.ai_compact_mode) | ||
|
|
||
| # OpenAI and Anthropic use message list format | ||
| if provider.lower() in ["openai", "anthropic"]: | ||
| return [ | ||
| {"role": "system", "content": system_prompt}, | ||
| {"role": "user", "content": user_prompt} | ||
| ] | ||
|
|
||
| # Gemini and Ollama use concatenated string format | ||
| else: | ||
| return f"{system_prompt}\n\n{user_prompt}" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix ai_include_simple_reference attribute mismatch.
AiAssistedStrategySettings defines ai_exclude_simple_reference, so accessing settings.ai_include_simple_reference will raise at runtime. Use the inverse flag.
🛠️ Proposed fix
- system_prompt = get_system_prompt(
- provider,
- include_simple_ref=settings.ai_include_simple_reference
- )
+ system_prompt = get_system_prompt(
+ provider,
+ include_simple_ref=not settings.ai_exclude_simple_reference
+ )🤖 Prompt for AI Agents
In `@robusta_krr/strategies/ai_prompts.py.backup` around lines 100 - 133, The code
in format_messages uses settings.ai_include_simple_reference which doesn't exist
on AiAssistedStrategySettings; replace this with the inverse of the existing
flag by passing include_simple_ref=not settings.ai_exclude_simple_reference into
get_system_prompt (or compute a local include_simple_ref variable from
settings.ai_exclude_simple_reference and use that) so the system prompt receives
the correct boolean; update references in format_messages (and any related
callers) to use the corrected flag logic.
| else | ||
| echo "Mode: GKE Cloud" | ||
| fi | ||
| echo "CPU Percentile: ${CPU_PERCENTILE}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU_PERCENTILE is referenced but not defined.
Line 74 echoes ${CPU_PERCENTILE}, and lines 117/133 use it in the krr.py command, but the variable definition is commented out (lines 25-28). This will result in an empty value being passed to --cpu-percentile.
🐛 Proposed fix: uncomment or provide default
-# CPU_PERCENTILE="${CPU_PERCENTILE:-95}"
-# if [ -n "${4:-}" ]; then
-# CPU_PERCENTILE="${4}"
-# fi
+CPU_PERCENTILE="${CPU_PERCENTILE:-95}"
+if [ -n "${4:-}" ]; then
+ CPU_PERCENTILE="${4}"
+fi🤖 Prompt for AI Agents
In `@test_gcp_quick.sh` at line 74, The script echoes and passes CPU_PERCENTILE
but its definition is commented out; restore or set a default for CPU_PERCENTILE
so it is never empty when used by krr.py and in the echo. Re-enable the original
CPU_PERCENTILE export (the commented block that sets CPU_PERCENTILE) or add a
fallback assignment (e.g., set CPU_PERCENTILE to a sensible default using
parameter expansion) so references to CPU_PERCENTILE in the echo and the krr.py
--cpu-percentile flags receive a valid value.
| def test_loader_mapping(self): | ||
| """Test that all expected loaders are mapped.""" | ||
| from robusta_krr.core.integrations.prometheus.metrics_service.gcp_metrics_service import ( | ||
| GcpManagedPrometheusMetricsService | ||
| ) | ||
|
|
||
| mapping = GcpManagedPrometheusMetricsService.LOADER_MAPPING | ||
|
|
||
| # Verify CPU loaders are mapped | ||
| assert "CPULoader" in mapping | ||
| assert "PercentileCPULoader" in mapping | ||
| assert "CPUAmountLoader" in mapping | ||
|
|
||
| # Verify Memory loaders are mapped | ||
| assert "MemoryLoader" in mapping | ||
| assert "MaxMemoryLoader" in mapping | ||
| assert "MemoryAmountLoader" in mapping | ||
|
|
||
| # Verify unsupported loader is marked as None | ||
| assert "MaxOOMKilledMemoryLoader" in mapping | ||
| assert mapping["MaxOOMKilledMemoryLoader"] is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mismatch with implementation: MaxOOMKilledMemoryLoader is not None.
GcpManagedPrometheusMetricsService.LOADER_MAPPING maps MaxOOMKilledMemoryLoader to GcpMaxOOMKilledMemoryLoader, so this test currently asserts the opposite and will fail (or encode the wrong contract).
✅ Proposed fix
from robusta_krr.core.integrations.prometheus.metrics.gcp.memory import (
GcpMemoryLoader,
GcpMaxMemoryLoader,
GcpMemoryAmountLoader,
+ GcpMaxOOMKilledMemoryLoader,
)
...
- assert "MaxOOMKilledMemoryLoader" in mapping
- assert mapping["MaxOOMKilledMemoryLoader"] is None
+ assert "MaxOOMKilledMemoryLoader" in mapping
+ assert mapping["MaxOOMKilledMemoryLoader"] is GcpMaxOOMKilledMemoryLoader🤖 Prompt for AI Agents
In `@tests/test_gcp_loaders.py` around lines 205 - 225, The test
test_loader_mapping incorrectly expects
GcpManagedPrometheusMetricsService.LOADER_MAPPING["MaxOOMKilledMemoryLoader"] to
be None while the implementation maps it to GcpMaxOOMKilledMemoryLoader; update
the assertion in test_loader_mapping to assert that "MaxOOMKilledMemoryLoader"
maps to GcpMaxOOMKilledMemoryLoader (or otherwise reflect the intended contract)
by referencing GcpManagedPrometheusMetricsService.LOADER_MAPPING and the
GcpMaxOOMKilledMemoryLoader symbol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@robusta_krr/core/integrations/ai/base.py`:
- Around line 169-172: The HTTP error handler is currently logging the full
upstream response body which may leak sensitive data; update the except
requests.HTTPError block in robusta_krr/core/integrations/ai/base.py to first
guard that e.response exists, then log the status code and a redacted or
truncated version of the body (e.g., show only the first N characters and append
"[TRUNCATED]" or replace body with "[REDACTED]"), and avoid including
e.response.text verbatim; keep the log message contextual (include
self.__class__.__name__) and ensure no raw prompt/output content is printed.
- Around line 86-90: The retry decorator currently retries on all
requests.RequestException (including requests.HTTPError raised by
response.raise_for_status on line 131), which causes retries for non-transient
4xx errors; replace the retry condition to only retry transient errors by: keep
explicit exception types like requests.Timeout and requests.ConnectionError, and
replace retry_if_exception_type((requests.RequestException, ...)) with a custom
predicate (retry_if_exception) that returns True for Timeout/ConnectionError and
for HTTPError only when exc.response.status_code is 5xx or 429; update the
decorator arguments (stop_after_attempt, wait_exponential remain) and ensure the
predicate references requests.HTTPError and inspects exc.response.status_code to
avoid retrying 4xx.
In `@robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/memory.py`:
- Around line 86-89: The docstring for class AnthosMemoryAmountLoader
incorrectly states it returns a "count of containers" while the implementation
uses Prometheus' count_over_time which counts samples/data points; update the
class docstring (AnthosMemoryAmountLoader) to clarify that it measures the
number of metric samples (data points) in the time window produced by
count_over_time rather than counting distinct containers or pods, and mention
how that maps to container activity if relevant (e.g., more samples indicate
more active containers or higher scrape frequency).
🧹 Nitpick comments (2)
robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/memory.py (1)
176-182: Consider usinglogger.debugfor consistency with other loaders.Other loaders in this file use
logger.debugfor query logging. Usinglogger.infohere will produce output at default log levels, which may be noisy in production.📝 Suggested change
- logger.info( + logger.debug( "Anthos OOM detection query (inference-based using restart_count + memory limit) for %s/%s/%s", object.namespace, object.name, object.container, )robusta_krr/core/integrations/ai/base.py (1)
174-182: Preserve exception context; drop unused variable.
Uselogger.exceptionand chain the raisedValueError; remove unusedein timeout block.♻️ Proposed refactor
- except requests.Timeout as e: - logger.error(f"Timeout calling {self.__class__.__name__} API after {self.timeout}s") + except requests.Timeout: + logger.exception(f"Timeout calling {self.__class__.__name__} API after {self.timeout}s") raise ... - except (KeyError, IndexError) as e: - logger.error(f"Failed to parse response from {self.__class__.__name__}: {e}") - raise ValueError(f"Invalid response format from {self.__class__.__name__}: {e}") + except (KeyError, IndexError) as e: + logger.exception(f"Failed to parse response from {self.__class__.__name__}: {e}") + raise ValueError(f"Invalid response format from {self.__class__.__name__}: {e}") from e
| @retry( | ||
| stop=stop_after_attempt(3), | ||
| wait=wait_exponential(multiplier=1, min=2, max=10), | ||
| retry=retry_if_exception_type((requests.RequestException, requests.Timeout)) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the target file
find . -name "base.py" -path "*/ai/*" | head -20Repository: robusta-dev/krr
Length of output: 101
🏁 Script executed:
# Get the specific lines from the file
if [ -f "robusta_krr/core/integrations/ai/base.py" ]; then
sed -n '80,95p' robusta_krr/core/integrations/ai/base.py
fiRepository: robusta-dev/krr
Length of output: 503
🏁 Script executed:
# Check imports and broader context
if [ -f "robusta_krr/core/integrations/ai/base.py" ]; then
head -30 robusta_krr/core/integrations/ai/base.py
fiRepository: robusta-dev/krr
Length of output: 923
🏁 Script executed:
# Let me check the requests library exception hierarchy
python3 << 'EOF'
import requests
# Print the exception hierarchy
print("requests.RequestException MRO:")
print(requests.RequestException.__mro__)
print("\nrequests.HTTPError MRO:")
print(requests.HTTPError.__mro__)
print("\nIs HTTPError a RequestException?")
print(issubclass(requests.HTTPError, requests.RequestException))
print("\nIs Timeout a RequestException?")
print(issubclass(requests.Timeout, requests.RequestException))
EOFRepository: robusta-dev/krr
Length of output: 495
🏁 Script executed:
# Let me also check if the analyze_metrics method actually gets HTTPErrors
# by looking at how requests.post or similar methods are called in this file
grep -n "requests\." robusta_krr/core/integrations/ai/base.py | head -20Repository: robusta-dev/krr
Length of output: 468
🏁 Script executed:
# Check if there's a raise_for_status() call or similar that would convert responses to HTTPError
sed -n '86,150p' robusta_krr/core/integrations/ai/base.pyRepository: robusta-dev/krr
Length of output: 2692
Avoid retrying non-transient 4xx responses.
The retry policy catches requests.RequestException, which includes requests.HTTPError. Since response.raise_for_status() on line 131 raises HTTPError for both 4xx and 5xx responses, the current retry decorator will unnecessarily retry client errors (401/403/404). These are non-recoverable and retrying wastes resources and can worsen rate limits. Restrict retries to transient errors only: timeouts, connection errors, 5xx responses, and 429 (rate limit).
🔧 Proposed fix (narrow retry policy)
-import requests
-from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
+import requests
+from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception
+
+def _is_retryable_exception(exc: Exception) -> bool:
+ if isinstance(exc, requests.HTTPError) and exc.response is not None:
+ return exc.response.status_code >= 500 or exc.response.status_code == 429
+ return isinstance(exc, (requests.Timeout, requests.ConnectionError))
...
- `@retry`(
- stop=stop_after_attempt(3),
- wait=wait_exponential(multiplier=1, min=2, max=10),
- retry=retry_if_exception_type((requests.RequestException, requests.Timeout))
- )
+ `@retry`(
+ stop=stop_after_attempt(3),
+ wait=wait_exponential(multiplier=1, min=2, max=10),
+ retry=retry_if_exception(_is_retryable_exception),
+ )🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/ai/base.py` around lines 86 - 90, The retry
decorator currently retries on all requests.RequestException (including
requests.HTTPError raised by response.raise_for_status on line 131), which
causes retries for non-transient 4xx errors; replace the retry condition to only
retry transient errors by: keep explicit exception types like requests.Timeout
and requests.ConnectionError, and replace
retry_if_exception_type((requests.RequestException, ...)) with a custom
predicate (retry_if_exception) that returns True for Timeout/ConnectionError and
for HTTPError only when exc.response.status_code is 5xx or 429; update the
decorator arguments (stop_after_attempt, wait_exponential remain) and ensure the
predicate references requests.HTTPError and inspects exc.response.status_code to
avoid retrying 4xx.
| except requests.HTTPError as e: | ||
| logger.error( | ||
| f"HTTP error from {self.__class__.__name__}: {e.response.status_code} - {e.response.text}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid logging full upstream response bodies.
HTTP error logs can leak sensitive prompt/output data. Truncate or redact the body and guard missing responses.
🔒 Proposed fix (truncate/redact body)
- except requests.HTTPError as e:
- logger.error(
- f"HTTP error from {self.__class__.__name__}: {e.response.status_code} - {e.response.text}"
- )
+ except requests.HTTPError as e:
+ status = getattr(e.response, "status_code", "unknown")
+ body_preview = (e.response.text or "")[:500] if e.response is not None else ""
+ logger.error(
+ f"HTTP error from {self.__class__.__name__}: {status} - {body_preview}"
+ )
raise🧰 Tools
🪛 Ruff (0.14.13)
170-172: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/ai/base.py` around lines 169 - 172, The HTTP
error handler is currently logging the full upstream response body which may
leak sensitive data; update the except requests.HTTPError block in
robusta_krr/core/integrations/ai/base.py to first guard that e.response exists,
then log the status code and a redacted or truncated version of the body (e.g.,
show only the first N characters and append "[TRUNCATED]" or replace body with
"[REDACTED]"), and avoid including e.response.text verbatim; keep the log
message contextual (include self.__class__.__name__) and ensure no raw
prompt/output content is printed.
| class AnthosMemoryAmountLoader(PrometheusMetric): | ||
| """ | ||
| Loads memory amount (count of containers) for Anthos. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify the docstring - count_over_time counts samples, not containers.
The docstring states "count of containers" but count_over_time counts the number of data points/samples in the time range, not the number of containers. Consider clarifying the purpose:
📝 Suggested docstring fix
class AnthosMemoryAmountLoader(PrometheusMetric):
"""
- Loads memory amount (count of containers) for Anthos.
+ Loads memory sample count over time for Anthos containers.
+ Used to determine data availability and sample density.
"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class AnthosMemoryAmountLoader(PrometheusMetric): | |
| """ | |
| Loads memory amount (count of containers) for Anthos. | |
| """ | |
| class AnthosMemoryAmountLoader(PrometheusMetric): | |
| """ | |
| Loads memory sample count over time for Anthos containers. | |
| Used to determine data availability and sample density. | |
| """ |
🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/memory.py` around
lines 86 - 89, The docstring for class AnthosMemoryAmountLoader incorrectly
states it returns a "count of containers" while the implementation uses
Prometheus' count_over_time which counts samples/data points; update the class
docstring (AnthosMemoryAmountLoader) to clarify that it measures the number of
metric samples (data points) in the time window produced by count_over_time
rather than counting distinct containers or pods, and mention how that maps to
container activity if relevant (e.g., more samples indicate more active
containers or higher scrape frequency).
Changelog - AI Strategy & GCP/Anthos Support
Changes from base commit:
9ce49cd6539a1808c3dd60a3e3d3fa2c99c667ab📋 Summary
This fork introduces two main features:
🤖 AI-Assisted Strategy
New Files
Core AI Integration (
robusta_krr/core/integrations/ai/)base.py- BaseAIProviderinterface with abstract methods for all providersopenai_provider.py- OpenAI provider (GPT-3.5/4/4o)gemini_provider.py- Google Gemini provider (1.5 Flash/Pro, 2.0)anthropic_provider.py- Anthropic Claude provider (3/3.5 Sonnet/Opus)ollama_provider.py- Ollama provider for local models__init__.py- Factory pattern for auto-detection from env varsREADME.md- AI providers architecture documentationStrategy Implementation (
robusta_krr/strategies/)ai_assisted.py(440 lines) -AIAssistedStrategyimplementation:ai_prompts.py(490 lines) - Prompt engineering system:extract_comprehensive_stats()- Extracts 20+ metrics from history_dataformat_messages()- Generates provider-specific promptsget_user_prompt()- Prompt template with HPA, OOM, trends contextModified Files
robusta_krr/strategies/simple.py:allow_hpaflag for HPA overriderobusta_krr/strategies/__init__.py:AIAssistedStrategy"ai-assisted"toStrategyTyperobusta_krr/main.py:--strategy,--ai-provider,--ai-model,--ai-api-key--ai-temperature,--ai-max-tokens,--ai-compact-mode--allow-hpaflag for both strategiesrobusta_krr/core/models/config.py:AISettingsclass with all AI parametersStrategySettingswithallow_hpa: boolDocumentation
docs/ai-assisted-strategy.md- Complete guide:AI_STRATEGY_IMPLEMENTATION.md- Technical documentation:ai_strategy_examples.sh- Script with 15+ practical examplesTests
tests/test_ai_strategy.py- Unit tests for AI providers and strategy logic☁️ GCP/Anthos Support
New Files
Metrics Service (
robusta_krr/core/integrations/prometheus/metrics_service/)gcp_metrics_service.py- GCP Managed Prometheus service:https://monitoring.googleapis.com/v1/projects/{project}/location/global/prometheuskubernetes.io/container/cpu/core_usage_time,kubernetes.io/container/memory/used_bytesanthos_metrics_service.py- Anthos on-prem cluster service:GCPMetricsServiceGCP/Anthos Loaders (
robusta_krr/core/integrations/prometheus/metrics/gcp/)Standard GCP Loaders:
cpu.py:GCPCPULoader- rate + quantile with GCP namingmemory.py:GCPMaxMemoryLoader,GCPMemoryAmountLoader- max/avg memoryAnthos-Specific Loaders (
gcp/anthos/):cpu.py:AnthosCPULoader- CPU with Anthos-specific queriesAnthosCPUAmountLoader- Total CPU usage per podmemory.py:AnthosMaxMemoryLoader- Max memory with GCP namingAnthosMemoryAmountLoader- Total memory per podAnthos OOM Detection:
increase(kube_pod_container_status_restarts_total{...}[2h])kube_pod_container_status_last_terminated_reasonunavailableModified Files
robusta_krr/core/integrations/prometheus/loader.py:--anthosmode detectionrobusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py:robusta_krr/core/integrations/prometheus/metrics/cpu.py:Documentation
docs/gcp-managed-prometheus-integration.md- GCP guide:CHANGES_GCP.md- GCP/Anthos-specific changelogrobusta_krr/core/integrations/prometheus/metrics/gcp/README.md- Loaders architectureTest Scripts
test_gcp_quick.sh- Anthos integration script:datev-svc-prdnamespace onprd-user-cluster-01clusterTests
tests/test_gcp_loaders.py- Standard GCP loaders teststests/test_anthos_loaders.py- Anthos-specific loaders tests📦 Dependencies
Added to
requirements.txt/pyproject.toml🔧 Configuration
AI Environment Variables
GCP Environment Variables
📊 Usage Examples
AI-Assisted with Gemini
GCP Anthos Cluster
python krr.py simple \ --prometheus-url "https://monitoring.googleapis.com/v1/projects/my-project/location/global/prometheus" \ --anthos \ --clusters connectgateway_my-project_global_my-cluster \ --namespace my-namespaceCombo AI + Anthos
./test_gcp_quick.sh # Uses AI-Assisted on Anthos cluster🎯 Performance Metrics
AI Strategy
GCP/Anthos
🔄 CI/CD Workflow
.github/workflows/pytest-on-push.yml📝 Implementation Notes
Architectural Choices
Compatibility
--allow-hpaHPA Behavior Notes
When HPA is detected without
--allow-hpaflag:--allow-hpaenabled for HPA-configured workloads