Skip to content

Conversation

@IonBoleac
Copy link

Changelog - AI Strategy & GCP/Anthos Support

Changes from base commit: 9ce49cd6539a1808c3dd60a3e3d3fa2c99c667ab

📋 Summary

This fork introduces two main features:

  1. AI-Assisted Strategy - AI-based strategy (OpenAI, Gemini, Claude, Ollama) for intelligent recommendations
  2. GCP/Anthos Support - Full support for GCP Managed Prometheus and Anthos on-prem clusters

🤖 AI-Assisted Strategy

New Files

Core AI Integration (robusta_krr/core/integrations/ai/)

  • base.py - Base AIProvider interface with abstract methods for all providers
  • openai_provider.py - OpenAI provider (GPT-3.5/4/4o)
  • gemini_provider.py - Google Gemini provider (1.5 Flash/Pro, 2.0)
  • anthropic_provider.py - Anthropic Claude provider (3/3.5 Sonnet/Opus)
  • ollama_provider.py - Ollama provider for local models
  • __init__.py - Factory pattern for auto-detection from env vars
  • README.md - AI providers architecture documentation

Strategy Implementation (robusta_krr/strategies/)

  • ai_assisted.py (440 lines) - AIAssistedStrategy implementation:

    • Partial HPA check (CPU/Memory separated)
    • Comprehensive statistics extraction (percentiles, trends, spikes)
    • Integration with all AI providers
    • Sanity check vs Simple strategy with warnings
    • OOM events handling and confidence scoring
  • ai_prompts.py (490 lines) - Prompt engineering system:

    • extract_comprehensive_stats() - Extracts 20+ metrics from history_data
    • format_messages() - Generates provider-specific prompts
    • get_user_prompt() - Prompt template with HPA, OOM, trends context
    • Compact mode support to reduce token usage (~60%)

Modified Files

  • robusta_krr/strategies/simple.py:

    • Added allow_hpa flag for HPA override
    • Initialization log: "HPA override enabled" when active
    • Maintained original logic: HPA as ON/OFF gate
  • robusta_krr/strategies/__init__.py:

    • Import AIAssistedStrategy
    • Added "ai-assisted" to StrategyType
  • robusta_krr/main.py:

    • New CLI args: --strategy, --ai-provider, --ai-model, --ai-api-key
    • AI parameters: --ai-temperature, --ai-max-tokens, --ai-compact-mode
    • --allow-hpa flag for both strategies
  • robusta_krr/core/models/config.py:

    • AISettings class with all AI parameters
    • Extended StrategySettings with allow_hpa: bool

Documentation

  • docs/ai-assisted-strategy.md - Complete guide:

    • Setup for each provider (API keys, models)
    • Usage examples with cost analysis
    • Best practices and troubleshooting
  • AI_STRATEGY_IMPLEMENTATION.md - Technical documentation:

    • Detailed architecture
    • Execution flow
    • Comparison with Simple strategy
  • ai_strategy_examples.sh - Script with 15+ practical examples

Tests

  • tests/test_ai_strategy.py - Unit tests for AI providers and strategy logic

☁️ GCP/Anthos Support

New Files

Metrics Service (robusta_krr/core/integrations/prometheus/metrics_service/)

  • gcp_metrics_service.py - GCP Managed Prometheus service:

    • API connection https://monitoring.googleapis.com/v1/projects/{project}/location/global/prometheus
    • Auto-detection of GCP credentials (ADC, Service Account)
    • GCP naming support: kubernetes.io/container/cpu/core_usage_time, kubernetes.io/container/memory/used_bytes
  • anthos_metrics_service.py - Anthos on-prem cluster service:

    • Inherits from GCPMetricsService
    • Connect Gateway handling for on-prem clusters
    • Fallback to Kubernetes API when cluster metrics unavailable

GCP/Anthos Loaders (robusta_krr/core/integrations/prometheus/metrics/gcp/)

Standard GCP Loaders:

  • cpu.py: GCPCPULoader - rate + quantile with GCP naming
  • memory.py: GCPMaxMemoryLoader, GCPMemoryAmountLoader - max/avg memory

Anthos-Specific Loaders (gcp/anthos/):

  • cpu.py:

    • AnthosCPULoader - CPU with Anthos-specific queries
    • AnthosCPUAmountLoader - Total CPU usage per pod
  • memory.py:

    • AnthosMaxMemoryLoader - Max memory with GCP naming
    • AnthosMemoryAmountLoader - Total memory per pod

Anthos OOM Detection:

  • Inference-based using restart_count + memory limit
  • Query: increase(kube_pod_container_status_restarts_total{...}[2h])
  • Fallback when kube_pod_container_status_last_terminated_reason unavailable

Modified Files

  • robusta_krr/core/integrations/prometheus/loader.py:

    • Added --anthos mode detection
    • Automatic switch between standard/GCP/Anthos loaders
    • Active mode log: "GCP Anthos mode enabled"
  • robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py:

    • GCP-specific error handling
    • Warnings when GCP metrics return empty
  • robusta_krr/core/integrations/prometheus/metrics/cpu.py:

    • Kept standard loader intact
    • Conditional import for GCP loaders

Documentation

  • docs/gcp-managed-prometheus-integration.md - GCP guide:

    • GCP credentials and permissions setup
    • Managed Prometheus configuration
    • Examples with project ID and location
  • CHANGES_GCP.md - GCP/Anthos-specific changelog

  • robusta_krr/core/integrations/prometheus/metrics/gcp/README.md - Loaders architecture

Test Scripts

  • test_gcp_quick.sh - Anthos integration script:
    • Tests datev-svc-prd namespace on prd-user-cluster-01 cluster
    • Tests both Simple and AI-Assisted strategies
    • Auto-fetch GCP token with gcloud

Tests

  • tests/test_gcp_loaders.py - Standard GCP loaders tests
  • tests/test_anthos_loaders.py - Anthos-specific loaders tests

📦 Dependencies

Added to requirements.txt / pyproject.toml

# AI Providers
openai = "^1.0.0"              # OpenAI GPT
google-generativeai = "^0.3.0" # Google Gemini
anthropic = "^0.8.0"           # Anthropic Claude

# GCP Support  
google-auth = "^2.0.0"         # GCP Authentication
google-cloud-monitoring = "^2.0.0" # GCP Monitoring API


🔧 Configuration

AI Environment Variables

# Auto-detection (priority order)
OPENAI_API_KEY=sk-...         # → openai
GEMINI_API_KEY=...            # → gemini  
ANTHROPIC_API_KEY=...         # → anthropic
OLLAMA_HOST=http://...        # → ollama

# Manual override
export KRR_AI_PROVIDER=gemini
export KRR_AI_MODEL=gemini-2.0-flash-exp

GCP Environment Variables

# GCP Authentication
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

# Or use Application Default Credentials
gcloud auth application-default login

# For Anthos
gcloud container fleet memberships get-credentials CLUSTER_NAME \
  --location=global --project=PROJECT_ID

📊 Usage Examples

AI-Assisted with Gemini

python krr.py simple --strategy ai-assisted \
  --ai-provider gemini \
  --ai-model gemini-2.0-flash-exp \
  --ai-temperature 0.3 \
  --namespace production \
  --allow-hpa

GCP Anthos Cluster

python krr.py simple \
  --prometheus-url "https://monitoring.googleapis.com/v1/projects/my-project/location/global/prometheus" \
  --anthos \
  --clusters connectgateway_my-project_global_my-cluster \
  --namespace my-namespace

Combo AI + Anthos

./test_gcp_quick.sh  # Uses AI-Assisted on Anthos cluster

🎯 Performance Metrics

AI Strategy

  • Average token usage: 1500-2000 per workload (standard), 600-800 (compact mode)
  • Latency: ~2-5s per workload (depends on provider)
  • Confidence: 85-95% typical for stable workloads

GCP/Anthos

  • Default history: 230 hours (vs 168 hours standard)
  • Step: 2 minutes
  • OOM Detection: Inference-based when terminate reason unavailable

🔄 CI/CD Workflow

.github/workflows/pytest-on-push.yml

  • Added tests for AI providers (mocked)
  • GCP/Anthos loaders tests
  • API keys secrets management (optional)

📝 Implementation Notes

Architectural Choices

  1. AI Provider Factory: Auto-detection from env vars for seamless UX
  2. Prompt Engineering: Modular template with context injection (HPA, OOM, trends)
  3. GCP Loaders Separation: Maintained backward compatibility with standard loaders
  4. Anthos as Extension: Inherits from GCP, overrides only on-prem specific queries

Compatibility

  • Backward compatible: All existing features work unchanged
  • Strategy fallback: If AI fails, doesn't crash (returns undefined)
  • GCP optional: Requires GCP credentials only when used
  • HPA handling: Both strategies support --allow-hpa

HPA Behavior Notes

When HPA is detected without --allow-hpa flag:

  • AI-Assisted: Logs "skipping" but still analyzes all resources with AI (by design - AI can handle HPA context well)
  • Simple: Completely skips resources with HPA (binary gate)
  • Both strategies work best with --allow-hpa enabled for HPA-configured workloads

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

Walkthrough

Adds an AI-Assisted resource recommendation strategy (multi-provider support: OpenAI, Gemini, Anthropic, Ollama), extensive AI provider integrations, prompt/statistics utilities, GCP Managed Prometheus and Anthos metric loaders/services, CLI/config flags, docs, examples, and comprehensive tests.

Changes

Cohort / File(s) Summary
Gitignore Updates
\.gitignore``
Added ignore patterns: /examples/algorithm_demos/**, **/*.table, **/krr-*.
AI provider core & factory
robusta_krr/core/integrations/ai/base.py, .../ai/__init__.py
New AIProvider base with HTTP/retry/JSON extraction and get_provider() factory exporting providers.
Provider implementations
robusta_krr/core/integrations/ai/openai_provider.py, .../gemini_provider.py, .../anthropic_provider.py, .../ollama_provider.py
Concrete providers for OpenAI, Gemini, Anthropic, Ollama implementing endpoint, headers, request formatting, and response parsing.
AI strategy & prompts
robusta_krr/strategies/ai_assisted.py, robusta_krr/strategies/ai_prompts.py, robusta_krr/strategies/__init__.py
New AiAssistedStrategy and AiAssistedStrategySettings, provider detection, run flow, sanity checks vs Simple strategy, prompt builders and stats extraction; exported via strategies package.
AI docs & README
docs/ai-assisted-strategy.md, robusta_krr/core/integrations/ai/README.md
New comprehensive documentation for AI strategy, providers, configuration, prompts, and troubleshooting.
AI examples & tests
ai_strategy_examples.sh, tests/test_ai_strategy.py
Example script demonstrating providers/modes and extensive unit tests for stats, prompts, provider integration, validation, and error handling.
GCP/Anthos metrics package
robusta_krr/core/integrations/prometheus/metrics/gcp/__init__.py, .../gcp/*
New GCP loader package and exports.
GCP CPU loaders
robusta_krr/core/integrations/prometheus/metrics/gcp/cpu.py
GcpCPULoader, GcpPercentileCPULoader(percentile), GcpCPUAmountLoader with GCP metric names and cluster-label support.
GCP memory loaders
robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py
GcpMemoryLoader, GcpMaxMemoryLoader, GcpMemoryAmountLoader, GcpMaxOOMKilledMemoryLoader (OOM inference).
Anthos loaders package
robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/__init__.py, .../anthos/*
Anthos-specific CPU and memory loaders mirroring GCP variants using kubernetes.io/anthos/* metrics.
GCP & Anthos metrics services
robusta_krr/core/integrations/prometheus/metrics_service/gcp_metrics_service.py, .../anthos_metrics_service.py
GcpManagedPrometheusMetricsService and AnthosMetricsService with loader mapping, connection checks, and gathering logic.
Prometheus loader routing
robusta_krr/core/integrations/prometheus/loader.py
When a Prometheus URL is provided, route to GCP or Anthos metrics service based on URL and gcp_anthos flag.
Config & CLI
robusta_krr/core/models/config.py, robusta_krr/main.py
Added gcp_anthos: bool to Config and --gcp-anthos CLI flag; create_strategy uses settings object for strategy construction.
Runner & Simple strategy logging
robusta_krr/core/runner.py, robusta_krr/strategies/simple.py
Downgraded Prometheus pod-discovery fallback log to DEBUG; added HPA detection logging and CPU-percentile backend usage logging in Simple strategy.
Minor loader tweak
robusta_krr/core/integrations/prometheus/metrics/cpu.py
Added _percentile attribute to dynamic PercentileCPULoader for internal tracking.
GCP docs & quick test
docs/gcp-managed-prometheus-integration.md, robusta_krr/core/integrations/prometheus/metrics/gcp/README.md, test_gcp_quick.sh
Documentation for GCP Managed Prometheus/Anthos integration and a quick-test shell script.
GCP/Anthos tests
tests/test_gcp_loaders.py, tests/test_anthos_loaders.py
Tests validating GCP and Anthos loader queries, label handling, percentile validation, and LOADER_MAPPING.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as CLI/Config
    participant Runner as KRR Runner
    participant Strategy as AiAssistedStrategy
    participant PromService as Prometheus Service
    participant AIProvider as AI Provider
    participant Metrics as Metric Loaders

    User->>CLI: Run with AI strategy options
    CLI->>Runner: Create Config & invoke run
    Runner->>Strategy: Instantiate AiAssistedStrategy
    Strategy->>Strategy: _detect_provider() (config/env)
    Strategy->>AIProvider: Initialize chosen provider

    Runner->>PromService: gather metrics for workload
    PromService->>Metrics: Query CPU/Memory loaders
    Metrics-->>PromService: Return time series
    PromService-->>Strategy: Provide MetricsPodData

    Strategy->>Strategy: extract_comprehensive_stats()
    Strategy->>Strategy: format_messages()
    Strategy->>AIProvider: analyze_metrics(messages, temp, max_tokens)
    AIProvider->>AIProvider: HTTP POST -> provider API
    AIProvider-->>Strategy: recommendations + reasoning
    Strategy->>Strategy: _sanity_check() & clamp values
    Strategy-->>Runner: Return ResourceRecommendation
    Runner-->>User: Output recommendations
Loading
sequenceDiagram
    actor User
    participant Loader as PrometheusLoader
    participant ServiceSelector as Loader.selector
    participant PromService as PrometheusMetricsService
    participant GCPService as GcpManagedPrometheusMetricsService
    participant AnthosService as AnthosMetricsService

    User->>Loader: provide --prometheus-url (monitoring.googleapis.com)
    Loader->>ServiceSelector: inspect URL & gcp_anthos flag
    alt gcp_anthos = true
        ServiceSelector->>AnthosService: select AnthosMetricsService
    else gcp_anthos = false
        ServiceSelector->>GCPService: select GcpManagedPrometheusMetricsService
    end
    ServiceSelector->>PromService: (default) select PrometheusMetricsService if not GCP
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • arikalon1
  • aantn
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Changelog - AI Strategy & GCP/Anthos Support' clearly summarizes the two main features added in this pull request: AI-Assisted Strategy and GCP/Anthos support.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing AI-Assisted Strategy components, GCP/Anthos support features, modified files, tests, and configuration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In `@robusta_krr/core/integrations/ai/gemini_provider.py`:
- Around line 10-21: Update _get_endpoint to stop embedding the API key in the
URL (remove the ?key={self.api_key} fragment) and modify _get_headers to include
the header "x-goog-api-key": self.api_key in addition to "Content-Type":
"application/json"; specifically change the methods _get_endpoint and
_get_headers so endpoints use f".../{self.model}:generateContent" and headers
return {"Content-Type": "application/json", "x-goog-api-key": self.api_key} to
move authentication into a header.

In `@robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py`:
- Around line 156-187: The current query in variable query multiplies memory
limit by restart_count to infer OOMs, which can inflate values; instead, check
for and prefer direct OOM-related metrics from GCP Managed Prometheus (e.g.,
container_oom_events_total, container_memory_failcnt,
container_start_time_seconds) and update the metric selection logic in memory.py
to use those when present (fall back to the existing limit-based approach only
if none are available). Concretely: detect availability of each metric via the
Prometheus API or by attempting a short query, then modify the query
construction around query / the code that uses object, pods_selector,
cluster_label, duration, step to use the direct OOM metrics; if none exist,
simplify the fallback (e.g., treat any restart_count>0 as a boolean signal or
return the raw memory limit instead of multiplying by restart_count) and update
the class/function docstring to explicitly document this limitation and the
chosen fallback.

In `@robusta_krr/core/models/config.py`:
- Around line 154-157: The current debug call logs raw other_args which may
contain secrets; change the logging in the method that creates strategy settings
(the logger.debug before calling StrategySettingsType and the subsequent debug)
to avoid printing raw values from other_args—either log only other_args.keys()
or construct a masked copy that replaces values for sensitive keys (eg.
"api_key", "token", "password", "secret") before passing to logger.debug; keep
creation via StrategySettingsType(**self.other_args) and continue returning
StrategyType(settings) but ensure only the safe representation (keys or masked
dict) is logged, not the original other_args.

In `@robusta_krr/strategies/ai_assisted.py`:
- Around line 302-323: The HPA skip logic in the block that checks
object_data.hpa and self.settings.allow_hpa is incorrect because it only returns
when both cpu_rec and memory_rec are set, allowing AI recommendations to
continue when only one HPA target exists; change the logic to immediately skip
AI by returning a map for both ResourceType.CPU and ResourceType.Memory using
ResourceRecommendation.undefined(info="HPA detected") (or None only if your
callers expect None for absent recs) whenever object_data.hpa is not None and
not self.settings.allow_hpa so that object_data.hpa, self.settings.allow_hpa,
ResourceRecommendation, and ResourceType are used to build and return the skip
result.

In `@robusta_krr/strategies/ai_prompts.py.backup`:
- Around line 17-97: The file contains a backup version where
extract_comprehensive_stats builds prompt_parts and returns a string instead of
returning the statistics dict expected by AiAssistedStrategy; remove or fix this
artifact by restoring extract_comprehensive_stats to return the stats dict (use
the later-constructed stats object) and move the prompt assembly (prompt_parts,
joins, and human-readable sections) into a separate function like
build_prompt_from_stats so AiAssistedStrategy can import
extract_comprehensive_stats (which should return dict) and call
build_prompt_from_stats(stats) when a string prompt is needed; alternatively
delete the .py.backup file if it is unintended to ship.
- Around line 100-133: The code in format_messages uses
settings.ai_include_simple_reference which doesn't exist on
AiAssistedStrategySettings; replace this with the inverse of the existing flag
by passing include_simple_ref=not settings.ai_exclude_simple_reference into
get_system_prompt (or compute a local include_simple_ref variable from
settings.ai_exclude_simple_reference and use that) so the system prompt receives
the correct boolean; update references in format_messages (and any related
callers) to use the corrected flag logic.

In `@test_gcp_quick.sh`:
- Line 74: The script echoes and passes CPU_PERCENTILE but its definition is
commented out; restore or set a default for CPU_PERCENTILE so it is never empty
when used by krr.py and in the echo. Re-enable the original CPU_PERCENTILE
export (the commented block that sets CPU_PERCENTILE) or add a fallback
assignment (e.g., set CPU_PERCENTILE to a sensible default using parameter
expansion) so references to CPU_PERCENTILE in the echo and the krr.py
--cpu-percentile flags receive a valid value.

In `@tests/test_gcp_loaders.py`:
- Around line 205-225: The test test_loader_mapping incorrectly expects
GcpManagedPrometheusMetricsService.LOADER_MAPPING["MaxOOMKilledMemoryLoader"] to
be None while the implementation maps it to GcpMaxOOMKilledMemoryLoader; update
the assertion in test_loader_mapping to assert that "MaxOOMKilledMemoryLoader"
maps to GcpMaxOOMKilledMemoryLoader (or otherwise reflect the intended contract)
by referencing GcpManagedPrometheusMetricsService.LOADER_MAPPING and the
GcpMaxOOMKilledMemoryLoader symbol.
🟡 Minor comments (16)
test_gcp_quick.sh-30-40 (1)

30-40: Color variables used before definition.

${RED} and ${NC} are referenced on lines 31-32 but defined on lines 37-40. This will cause the error message to appear without color formatting.

🐛 Proposed fix: move color definitions earlier
+# Colors
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m'
+
 if [ -z "${PROJECT_ID:-}" ] || [ -z "${CLUSTER_NAME:-}" ]; then
     echo -e "${RED}Error: PROJECT_ID and CLUSTER_NAME must be defined in .env or via environment variables.${NC}"
     exit 1
 fi
-
-
-# Colors
-RED='\033[0;31m'
-GREEN='\033[0;32m'
-YELLOW='\033[1;33m'
-NC='\033[0m'
robusta_krr/core/integrations/prometheus/metrics/gcp/README.md-97-100 (1)

97-100: Documentation inconsistency: MaxOOMKilledMemoryLoader is actually implemented.

The limitations section states that MaxOOMKilledMemoryLoader is not implemented, but memory.py in this same package does implement GcpMaxOOMKilledMemoryLoader using an inference-based approach (combining memory/limit_bytes and restart_count).

Consider updating this section to reflect the actual implementation and its inference-based nature:

📝 Suggested documentation update
 ## Limitations
 
-- **MaxOOMKilledMemoryLoader**: Not implemented because it depends on `kube-state-metrics` which may not be available in GCP Managed Prometheus.
+- **MaxOOMKilledMemoryLoader**: Uses an inference-based approach combining `kubernetes.io/container/memory/limit_bytes` and `kubernetes.io/container/restart_count` since `kube-state-metrics` is not available in GCP Managed Prometheus. This may produce false positives if containers restart for reasons other than OOM.
test_gcp_quick.sh-140-147 (1)

140-147: Exit code capture is unreachable on failure due to set -e.

With set -e enabled (line 2), the script exits immediately if krr.py returns a non-zero exit code. Thus, EXIT_CODE=$? on line 140 will only ever capture 0, making the failure branch (lines 146-147) unreachable.

🐛 Proposed fix: disable errexit for the krr.py command
 if [ "${AI_MODE:-false}" = "true" ]; then
     echo -e "${YELLOW}AI Mode enabled: Using AI-assisted strategy with Gemini 3 Flash Preview model.${NC}"
 
-    $PYTHON_CMD krr.py ai-assisted \
+    set +e
+    $PYTHON_CMD krr.py ai-assisted \
         ...
+    EXIT_CODE=$?
+    set -e
-
 else
     echo -e "${YELLOW}AI Mode disabled: Using standard KRR strategies.${NC}"
-    $PYTHON_CMD krr.py simple \
+    set +e
+    $PYTHON_CMD krr.py simple \
         ...
+    EXIT_CODE=$?
+    set -e
 fi
-
-EXIT_CODE=$?

Alternatively, capture exit code inline: $PYTHON_CMD krr.py ... || EXIT_CODE=$?

CHANGES_GCP.md-44-87 (1)

44-87: Add language identifiers to fenced code blocks

markdownlint reports MD040 at multiple fences. Adding a language (e.g., text, bash, promql) will fix it.

🛠️ Example tweak
-```
+```text
 ============================== 75 passed in 5.20s ==============================
</details>




Also applies to: 337-353, 558-567

</blockquote></details>
<details>
<summary>CHANGES_GCP.md-610-667 (1)</summary><blockquote>

`610-667`: **Remove or rename duplicate headings (MD024)**

Headings like “Usage Examples”, “Technical Highlights”, “Debugging”, and “Changelog” appear multiple times. Consolidate or rename to satisfy markdownlint.

</blockquote></details>
<details>
<summary>robusta_krr/core/integrations/prometheus/metrics_service/gcp_metrics_service.py-43-52 (1)</summary><blockquote>

`43-52`: **Use correct parameter name to maintain interface consistency**

The parameter should be named `api_client` (matching the base class and all other implementations), not `_api_client`. While the current codebase doesn't call this method with keyword arguments, keeping the signature consistent with `MetricsServiceDiscovery.find_metrics_url` is important for maintainability and prevents future issues.

<details>
<summary>🛠️ Proposed fix</summary>

```diff
-    def find_metrics_url(self, *, _api_client: Optional[ApiClient] = None) -> Optional[str]:
+    def find_metrics_url(self, *, api_client: Optional[ApiClient] = None) -> Optional[str]:
+        _ = api_client  # unused
         """
         GCP Managed Prometheus is typically accessed via a known URL pattern:
ai_strategy_examples.sh-302-311 (1)

302-311: Handle missing gcloud before token fetch

With set -e, a missing gcloud binary will cause the script to exit before your error handling executes. Add an explicit check so users get a clear message instead of an abrupt failure.

🛠️ Proposed fix
     # Get GCP access token
     echo ""
     echo -e "${YELLOW}Getting GCP access token...${NC}"
+    if ! command -v gcloud >/dev/null 2>&1; then
+        echo -e "${RED}ERROR: gcloud CLI not found${NC}"
+        echo "Install the Google Cloud SDK and run: gcloud auth login"
+        return
+    fi
     TOKEN=$(gcloud auth print-access-token 2>/dev/null)
robusta_krr/core/integrations/ai/README.md-7-19 (1)

7-19: Add language identifiers to fenced code blocks (markdownlint MD040).

These fences are missing language tags, which will fail markdownlint and reduce readability.

🔧 Example fix (apply similarly to other blocks)
-```
+```text
robusta_krr/
├── core/integrations/ai/
...
-```
+```

Also applies to: 264-272, 275-283, 286-289, 297-305

docs/gcp-managed-prometheus-integration.md-146-155 (1)

146-155: Deduplicate loader mapping bullets.

MemoryLoader and MaxMemoryLoader are listed twice; this reads like a copy/paste error.

📝 Proposed fix
 - `MemoryLoader` → `GcpMemoryLoader`
 - `MaxMemoryLoader` → `GcpMaxMemoryLoader`
-- `MemoryLoader` → `GcpMemoryLoader`
-- `MaxMemoryLoader` → `GcpMaxMemoryLoader`
 - `MemoryAmountLoader` → `GcpMemoryAmountLoader`
robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/__init__.py-22-29 (1)

22-29: Sort __all__ to satisfy Ruff RUF022.

🔧 Proposed fix
 __all__ = [
-    "AnthosCPULoader",
-    "AnthosPercentileCPULoader",
-    "AnthosCPUAmountLoader",
-    "AnthosMemoryLoader",
-    "AnthosMaxMemoryLoader",
-    "AnthosMemoryAmountLoader",
-    "AnthosMaxOOMKilledMemoryLoader",
+    "AnthosCPUAmountLoader",
+    "AnthosCPULoader",
+    "AnthosMaxMemoryLoader",
+    "AnthosMaxOOMKilledMemoryLoader",
+    "AnthosMemoryAmountLoader",
+    "AnthosMemoryLoader",
+    "AnthosPercentileCPULoader",
 ]
docs/ai-assisted-strategy.md-152-156 (1)

152-156: Add language identifiers to fenced blocks.

markdownlint MD040 flags these fences. Suggest text for output snippets.

✏️ Suggested edits
-```
+```text
 | Namespace | Name           | Container | CPU Request | CPU Limit | Memory Request | Memory Limit | Info                                                      |
 |-----------|----------------|-----------|-------------|-----------|----------------|--------------|-----------------------------------------------------------|
 | default   | nginx-deploy   | nginx     | 250m        | -         | 512Mi          | 512Mi        | AI: Based on p95 CPU at 0.18 cores with... (conf: 85%)  |
-```
+```

-```
+```text
 Error: No AI provider API key found. Set OPENAI_API_KEY, GEMINI_API_KEY, or ANTHROPIC_API_KEY
-```
+```

-```
+```text
 Error: Rate limit exceeded for API calls
-```
+```

-```
+```text
 AI: Insufficient data for reliable... (conf: 30%)
-```
+```

-```
+```text
 Error: Failed to connect to Ollama at http://localhost:11434
-```
+```

Also applies to: 219-221, 230-232, 241-243, 252-254

docs/ai-assisted-strategy.md-234-237 (1)

234-237: Hyphenate the compound modifier.

Use “higher‑tier API plan” (per LanguageTool hint).

✏️ Suggested edit
-- Switch to a higher tier API plan
+- Switch to a higher-tier API plan
robusta_krr/core/integrations/ai/openai_provider.py-21-47 (1)

21-47: Add model validation for JSON mode support to prevent 4xx errors.

response_format={"type":"json_object"} is only supported by certain OpenAI models (gpt-3.5-turbo and gpt-4-* family). If a user selects an unsupported model, the request will fail with a 400 error. Consider either validating the model against a supported list before requesting or adding a fallback that omits JSON mode for unsupported models.

AI_STRATEGY_IMPLEMENTATION.md-275-289 (1)

275-289: File count mismatch in documentation.

The heading states "Created (11 files)" but the list contains 12 items (numbered 1-12). Update the count to match.

📝 Suggested fix
-### Created (11 files):
+### Created (12 files):
robusta_krr/core/integrations/ai/anthropic_provider.py-44-56 (1)

44-56: Handle edge case when all messages are system messages.

If all messages in the input list have role: "system", conversation_messages will be empty. The Anthropic API requires at least one non-system message, which would cause an API error.

🛡️ Suggested defensive check
         for msg in messages:
             if msg.get("role") == "system":
                 system_message = msg["content"]
             else:
                 conversation_messages.append(msg)
         
+        # Ensure at least one user message exists
+        if not conversation_messages:
+            conversation_messages = [{"role": "user", "content": "Analyze the provided data."}]
+        
         body = {
             "model": self.model,
robusta_krr/strategies/ai_assisted.py-166-191 (1)

166-191: Avoid mismatching provider and API key when ai_provider is explicit.
If ai_provider is set but ai_api_key is not, the loop still picks the first available env var (e.g., OPENAI_API_KEY) and then overrides the provider to the explicit setting, potentially pairing the wrong key with the chosen provider. Consider restricting env lookup to the explicit provider or raising if its key is missing.

🧹 Nitpick comments (14)
robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py (1)

188-194: Inconsistent log level: use logger.debug() for consistency.

Other loaders in this file use logger.debug() for query logging, but this one uses logger.info(). Unless there's a specific reason to log OOM queries at a higher level, consider using debug for consistency.

♻️ Suggested fix
-        logger.info(
+        logger.debug(
             "GCP OOM detection query (inference-based using restart_count + memory limit) for %s/%s/%s:",
             object.namespace,
             object.name,
             object.container,
         )
robusta_krr/core/integrations/prometheus/metrics_service/gcp_metrics_service.py (1)

95-116: Prefer logger.exception to capture stack traces

Since this is inside an exception handler, logger.exception(...) will capture the traceback automatically and usually removes the need for manual “Cause” formatting.

♻️ Suggested tweak
-        except MetricsNotFound as e:
-            logger.error(
-                "Failed to connect to GCP Managed Prometheus at %s. Verify the URL, "
-                "authentication token, and that Managed Service for Prometheus is enabled."
-                " Cause: %s: %s",
-                self.url,
-                e.__class__.__name__,
-                e,
-            )
+        except MetricsNotFound as e:
+            logger.exception(
+                "Failed to connect to GCP Managed Prometheus at %s. Verify the URL, "
+                "authentication token, and that Managed Service for Prometheus is enabled.",
+                self.url,
+            )
             raise MetricsNotFound(
                 f"Couldn't connect to GCP Managed Prometheus at {self.url}. See logs for details."
             ) from e
robusta_krr/core/integrations/ai/ollama_provider.py (1)

20-25: Normalize OLLAMA_HOST to avoid double slashes.

If users set OLLAMA_HOST with a trailing /, the endpoint becomes //api/generate. It’s usually accepted, but easy to normalize.

♻️ Proposed tweak
-        self.host = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
+        self.host = os.environ.get("OLLAMA_HOST", "http://localhost:11434").rstrip("/")
AI_STRATEGY_IMPLEMENTATION.md (1)

221-236: Add language specifier to fenced code block.

The code block at line 221 is missing a language specifier. This helps with syntax highlighting and markdown linting compliance.

📝 Suggested fix
-```
+```text
 ╭─ Strategy Settings ─────────────────────────────╮
robusta_krr/core/integrations/prometheus/metrics/gcp/__init__.py (1)

13-21: Consider sorting __all__ for consistency.

Static analysis suggests sorting __all__ alphabetically for easier maintenance.

♻️ Suggested fix
 __all__ = [
+    "GcpCPUAmountLoader",
     "GcpCPULoader",
+    "GcpMaxMemoryLoader",
+    "GcpMaxOOMKilledMemoryLoader",
+    "GcpMemoryAmountLoader",
+    "GcpMemoryLoader",
     "GcpPercentileCPULoader",
-    "GcpCPUAmountLoader",
-    "GcpMemoryLoader",
-    "GcpMaxMemoryLoader",
-    "GcpMemoryAmountLoader",
-    "GcpMaxOOMKilledMemoryLoader",
 ]
tests/test_ai_strategy.py (3)

3-6: Remove unused imports.

MagicMock and HPAData are imported but never used in this test file.

♻️ Suggested fix
 import json
-from unittest.mock import Mock, patch, MagicMock
+from unittest.mock import Mock, patch
 import numpy as np
 import pytest
 
 from robusta_krr.core.abstract.strategies import MetricsPodData, ResourceType
-from robusta_krr.core.models.objects import K8sObjectData, PodData, HPAData
+from robusta_krr.core.models.objects import K8sObjectData, PodData

233-248: Remove unused variables in test_compact_mode.

settings_full and settings_compact are created but never used. The test calls get_user_prompt directly with the compact parameter instead of using the settings objects.

♻️ Suggested fix
     def test_compact_mode(self, sample_history_data, sample_object_data):
         """Test compact mode reduces prompt length."""
         stats = ai_prompts.extract_comprehensive_stats(
             sample_history_data,
             sample_object_data
         )
         
-        settings_full = AiAssistedStrategySettings(ai_compact_mode=False)
-        settings_compact = AiAssistedStrategySettings(ai_compact_mode=True)
-        
         full_prompt = ai_prompts.get_user_prompt(stats, compact=False)
         compact_prompt = ai_prompts.get_user_prompt(stats, compact=True)

30-61: Test data uses random values which may cause flaky tests.

The fixture uses np.random.normal() without a fixed seed, which could theoretically cause test flakiness if assertions depend on specific statistical properties. Consider setting a random seed for reproducibility.

♻️ Suggested fix
 `@pytest.fixture`
 def sample_history_data() -> MetricsPodData:
     """Create sample Prometheus metrics data."""
+    np.random.seed(42)  # Fixed seed for reproducibility
     # CPU data: 100 time points with values around 0.2 cores
     cpu_timestamps = np.linspace(0, 3600, 100)
     cpu_values = np.random.normal(0.2, 0.05, 100)
robusta_krr/core/integrations/ai/base.py (1)

158-171: Improve exception logging and chaining.

Several exception handlers could be improved:

  1. Use logger.exception() instead of logger.error() to include stack traces
  2. Chain exceptions with raise ... from e for better debugging
  3. The timeout exception variable e is unused
♻️ Suggested improvements
         except requests.HTTPError as e:
-            logger.error(
+            logger.exception(
                 f"HTTP error from {self.__class__.__name__}: {e.response.status_code} - {e.response.text}"
             )
             raise
-        except requests.Timeout as e:
-            logger.error(f"Timeout calling {self.__class__.__name__} API after {self.timeout}s")
+        except requests.Timeout:
+            logger.exception(f"Timeout calling {self.__class__.__name__} API after {self.timeout}s")
             raise
         except requests.RequestException as e:
-            logger.error(f"Request error calling {self.__class__.__name__}: {e}")
+            logger.exception(f"Request error calling {self.__class__.__name__}: {e}")
             raise
         except (KeyError, IndexError) as e:
-            logger.error(f"Failed to parse response from {self.__class__.__name__}: {e}")
-            raise ValueError(f"Invalid response format from {self.__class__.__name__}: {e}")
+            logger.exception(f"Failed to parse response from {self.__class__.__name__}: {e}")
+            raise ValueError(f"Invalid response format from {self.__class__.__name__}: {e}") from e
robusta_krr/strategies/ai_prompts.py (4)

91-106: Use next(iter()) instead of list()[0] for efficiency.

Converting to a list just to get the first element is inefficient. Use next(iter()) instead.

♻️ Suggested fix
                 try:
                     # Use timestamps and values from first pod for trend
-                    first_pod_data = list(cpu_data.values())[0]
+                    first_pod_data = next(iter(cpu_data.values()))
                     if len(first_pod_data) > 1:

347-355: Remove extraneous f-string prefixes.

Lines 350 and 351 have f prefixes on strings without placeholders.

♻️ Suggested fix
     prompt_parts = [
         f"## Workload: {workload['kind']} {workload['namespace']}/{workload['name']}",
         f"Container: {workload['container']}",
-        f"",
-        f"## Pod Information",
+        "",
+        "## Pod Information",
         f"- Current pods: {pods['current_count']}",

214-236: Unused provider parameter in get_system_prompt.

The provider parameter is defined but never used in the function body. If it's intended for future provider-specific prompts, consider documenting this or removing it.

If this is intentional for future extensibility, add a comment or use it for provider-specific customization. Otherwise, remove the unused parameter.


491-523: Unused object_data parameter in format_messages.

The object_data parameter is passed but never used. The function signature accepts it, suggesting it may be intended for future enhancements (e.g., provider-specific formatting based on workload type).

Consider either:

  1. Using object_data for provider-specific customizations
  2. Adding a # noqa: ARG001 comment to suppress the warning if kept for API consistency
  3. Removing if not needed
robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/memory.py (1)

123-183: Verify restart_count multiplication doesn’t inflate OOM “max memory.”
restart_count is cumulative; multiplying by memory/limit_bytes scales the value by the number of restarts, while downstream treats it as a memory amount. Consider gating with clamp_max(..., 1) or a boolean > 0 check (or increase over the window) to avoid exaggerated values.

Comment on lines +10 to +21
def _get_endpoint(self) -> str:
"""Get Gemini API endpoint with API key in URL."""
return (
f"https://generativelanguage.googleapis.com/v1beta/models/"
f"{self.model}:generateContent?key={self.api_key}"
)

def _get_headers(self) -> dict:
"""Get headers for Gemini API."""
return {
"Content-Type": "application/json"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Gemini API authentication methods x-goog-api-key header vs key query parameter

💡 Result:

Short answer: use the x-goog-api-key header. Gemini (Generative Language) documentation shows the API key in the X-goog-api-key header and Google Cloud docs say the header is preferred; the key= query parameter is supported for compatibility but is less secure (exposes the key in URLs, logs, browser history). Use header or a server-side proxy, and restrict/regenerate keys if leaked. (ai.google.dev)

Citations:


Move Gemini API key from URL query parameter to header.

The current implementation embeds the API key in the URL query parameter, which exposes it to logs, proxies, and browser history. Gemini API supports header‑based authentication via the x-goog-api-key header, which is the preferred approach.

Proposed fix
     def _get_endpoint(self) -> str:
-        """Get Gemini API endpoint with API key in URL."""
+        """Get Gemini API endpoint."""
         return (
             f"https://generativelanguage.googleapis.com/v1beta/models/"
-            f"{self.model}:generateContent?key={self.api_key}"
+            f"{self.model}:generateContent"
         )
 
     def _get_headers(self) -> dict:
         """Get headers for Gemini API."""
         return {
-            "Content-Type": "application/json"
+            "Content-Type": "application/json",
+            "x-goog-api-key": self.api_key,
         }
🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/ai/gemini_provider.py` around lines 10 - 21,
Update _get_endpoint to stop embedding the API key in the URL (remove the
?key={self.api_key} fragment) and modify _get_headers to include the header
"x-goog-api-key": self.api_key in addition to "Content-Type":
"application/json"; specifically change the methods _get_endpoint and
_get_headers so endpoints use f".../{self.model}:generateContent" and headers
return {"Content-Type": "application/json", "x-goog-api-key": self.api_key} to
move authentication into a header.

Comment on lines +156 to +187
query = f"""
label_replace(
label_replace(
max_over_time(
max(
max(
{{"__name__"="kubernetes.io/container/memory/limit_bytes",
"monitored_resource"="k8s_container",
"namespace_name"="{object.namespace}",
"pod_name"=~"{pods_selector}",
"container_name"="{object.container}"{cluster_label}
}}
) by (pod_name, container_name, job)
* on(pod_name, container_name, job) group_left()
max(
{{"__name__"="kubernetes.io/container/restart_count",
"monitored_resource"="k8s_container",
"namespace_name"="{object.namespace}",
"pod_name"=~"{pods_selector}",
"container_name"="{object.container}"{cluster_label}
}}
) by (pod_name, container_name, job)
) by (container_name, pod_name, job)
[{duration}:{step}]
),
"pod", "$1", "pod_name", "(.+)"
),
"container", "$1", "container_name", "(.+)"
)
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd robusta_krr && head -n 200 core/integrations/prometheus/metrics/gcp/memory.py | tail -n 100

Repository: robusta-dev/krr

Length of output: 4401


🏁 Script executed:

wc -l robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py

Repository: robusta-dev/krr

Length of output: 125


🏁 Script executed:

cat -n robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py | sed -n '130,200p'

Repository: robusta-dev/krr

Length of output: 3496


🏁 Script executed:

cat -n robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py | sed -n '1,50p'

Repository: robusta-dev/krr

Length of output: 2258


🌐 Web query:

GCP Managed Prometheus OOM detection metrics kubernetes

💡 Result:

Short answer — use Managed Service for Prometheus to query the usual Kubernetes / cAdvisor / kube-state-metrics signals (PromQL) and combine them for reliable OOM detection. Example queries and caveats below.

Useful metrics

  • kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} (kube-state-metrics) — shows last termination reason. [1][2]
  • kube_pod_container_status_restarts_total (kube-state-metrics) — restart counter to correlate with termination reason. [2][4]
  • container_oom_events_total or container_memory_oom_events_total (cAdvisor/cgroup) — monotonic OOM counters from the container runtime/cAdvisor. [3][4]
  • node_vmstat_oom_kill (node-level) — kernel/node OOM kill events (useful when container metrics are missing). [4]

Example PromQL (run in Cloud Monitoring / Managed Prometheus)

  • Alert when a pod has been OOMKilled multiple times recently:
    increase(kube_pod_container_status_restarts_total{container!=""}[15m])

    • on(namespace,pod) group_left(reason)
      kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0. [4]
  • Count OOM events per container over 1h (if container_oom_events_total is available):
    increase(container_oom_events_total[1h]). [3][4]

  • Node-level OOM spikes:
    increase(node_vmstat_oom_kill[5m]) > 0. [4]

Caveats / recommendations

  • kube_pod_container_status_last_terminated_reason is a gauge (shows last reason) and can miss events between scrapes; combine it with restart counters (or use monotonic counters) for reliable alerts. There are known gaps/edge-cases in kube-state-metrics. [2][5]
  • cAdvisor/container counters (container_oom_events_total) are most reliable for counts but may not be present in every collection setup — validate your collector (managed/self-deployed) scrapes cAdvisor metrics. [3]
  • Managed Service for Prometheus supports PromQL in Cloud Monitoring / Grafana — use it to run these queries and create alerts. [1][2]

Sources

  1. Managed Service for Prometheus / PromQL in Cloud Monitoring (overview + PromQL support). [Google Cloud Managed Prometheus / PromQL docs]. [1][2]
  2. GKE troubleshooting OOM events (how Kubernetes reports OOMKilled). [GKE Troubleshoot OOM events]. [4]
  3. cAdvisor/container OOM counters and usage in alerts. [cadvisor / forum & examples]. [3][4]
  4. Community examples and recommended query patterns (kube-state-metrics, container_oom_events_total, node_vmstat_oom_kill). [StackOverflow / blog posts / GH issue]. [4][5]

If you want, I can: (a) produce ready-to-add Cloud Monitoring alerting rules from one of the example queries, or (b) check which of those metrics your Managed Prometheus setup is actually ingesting and give a tailored query. Which would you like?


🌐 Web query:

Prometheus OOM detection without kube-state-metrics memory limit restart count

💡 Result:

Short answer: you can detect OOMs without kube‑state‑metrics by (a) parsing kernel OOM messages on the node (oom‑exporter / node logs), (b) using cAdvisor/cadvisored metrics (memory failcnt / RSS / start time) and (c) watching container start_time/restart from the runtime/cAdvisor metrics. Below are practical options + example PromQL.

Options

  • Deploy an oom‑kill exporter (parses kernel logs like /dev/kmsg) and expose a metric such as klog_pod_oomkill; alert on changes. [1]
  • Use node/cAdvisor metrics: watch container_memory_failcnt, container_memory_rss (or container_memory_working_set_bytes vs RSS) and detect sudden increases or failcnt > 0. Use these to infer hitting memory limits or OOM activity. [2][4]
  • Detect container restarts directly from cAdvisor/runtime metrics by observing container_start_time_seconds (or other per‑container start timestamp) increasing — a restart implies the container was killed (could be OOM). You can alert on increases to start time or on increase(container_start_time_seconds[5m]) > 0. (This method does not require kube‑state‑metrics). [2][3]
  • Collect node kernel/journal logs (journald/fluentd) and export OOM messages to Prometheus (custom exporter or existing projects). [1][2]

Example PromQL (adapt to your metric names/labels)

  • Alert on kernel‑detected OOMs (oom exporter):
    sum by(namespace,pod) (changes(klog_pod_oomkill[30m])) > 0. [1]
  • Alert on memory fail hits from cAdvisor:
    increase(container_memory_failcnt{job="kubelet"}[10m]) > 0. [4]
  • Alert on container restarts using start time (detect restart):
    increase(container_start_time_seconds{job="kubelet"}[10m]) > 0. [3]
  • Combine indicators (more robust):
    (increase(container_start_time_seconds[10m]) > 0) and (rate(container_memory_rss_bytes[5m]) > some_threshold)

Notes / tradeoffs

  • container_memory_failcnt indicates hits against memory limits, not necessarily kernel OOM kills; RSS is closer to actual resident memory that triggers OOM. [4]
  • Kernel OOM events are most definitive but require node‑level log access or an exporter; kubelet/cAdvisor pipeline surfaces some OOM info but may miss kernel messages unless cadvisor/kubelet capture them. [2][3]
  • If you can deploy kube‑state‑metrics later, kube_pod_container_status_restarts_total and last_terminated_reason (OOMKilled) give simpler pod‑level alerts.

References

  1. kubernetes-oomkill-exporter (parses kernel log, exposes klog_pod_oomkill). [1]
  2. LINE Engineering writeup on how OOM events are produced/watched (cadvisor, kernel logs, oomWatcher). [2]
  3. Kubernetes metrics reference (kubelet / cadvisor metrics like start times). [3]
  4. Discussion of container_memory_failcnt vs OOM behavior (StackOverflow). [4]

If you want, tell me which metric names you already scrape (cadvisor / node_exporter / journald) and I’ll give exact PromQL alerts for your setup.


The inference-based OOM detection approach is already documented as limited—investigate if better GCP metrics are available before accepting this trade-off.

The code's own docstring (lines 136–146) already acknowledges that this inference-based approach "may produce false positives if containers restart for reasons other than OOM while memory usage is high." However, the concern about multiplying limit_bytes × restart_count remains valid: this produces an inflated value proportional to restart count regardless of actual memory pressure, which could mislead the recommendation algorithm.

Before accepting this limitation, verify whether GCP Managed Prometheus exposes better metrics for OOM detection such as:

  • container_oom_events_total (cAdvisor—direct OOM event counter)
  • container_memory_failcnt (cAdvisor—memory limit hits)
  • container_start_time_seconds (runtime—detect restarts from timestamp changes)

If any of these are available in your GCP Prometheus setup, they would provide more reliable OOM signals than the current inference approach. If GCP truly lacks all alternatives, document this constraint explicitly in the class docstring and consider a simpler fallback (e.g., returning the memory limit when restarts exist) rather than multiplication.

🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/prometheus/metrics/gcp/memory.py` around lines
156 - 187, The current query in variable query multiplies memory limit by
restart_count to infer OOMs, which can inflate values; instead, check for and
prefer direct OOM-related metrics from GCP Managed Prometheus (e.g.,
container_oom_events_total, container_memory_failcnt,
container_start_time_seconds) and update the metric selection logic in memory.py
to use those when present (fall back to the existing limit-based approach only
if none are available). Concretely: detect availability of each metric via the
Prometheus API or by attempting a short query, then modify the query
construction around query / the code that uses object, pods_selector,
cluster_label, duration, step to use the direct OOM metrics; if none exist,
simplify the fallback (e.g., treat any restart_count>0 as a boolean signal or
return the raw memory limit instead of multiplying by restart_count) and update
the class/function docstring to explicitly document this limitation and the
chosen fallback.

Comment on lines +154 to +157
logger.debug(f"Creating strategy '{self.strategy}' with other_args: {self.other_args}")
settings = StrategySettingsType(**self.other_args)
logger.debug(f"Strategy settings created with use_oomkill_data={getattr(settings, 'use_oomkill_data', 'NOT_FOUND')}")
return StrategyType(settings) # type: ignore
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging raw strategy args (may contain secrets)

other_args can include API keys or tokens. Logging the dict can leak secrets to logs. Log only keys or mask sensitive fields.

🛠️ Proposed fix
-        logger.debug(f"Creating strategy '{self.strategy}' with other_args: {self.other_args}")
+        logger.debug(
+            "Creating strategy '%s' with other_args keys: %s",
+            self.strategy,
+            list(self.other_args.keys()),
+        )
🤖 Prompt for AI Agents
In `@robusta_krr/core/models/config.py` around lines 154 - 157, The current debug
call logs raw other_args which may contain secrets; change the logging in the
method that creates strategy settings (the logger.debug before calling
StrategySettingsType and the subsequent debug) to avoid printing raw values from
other_args—either log only other_args.keys() or construct a masked copy that
replaces values for sensitive keys (eg. "api_key", "token", "password",
"secret") before passing to logger.debug; keep creation via
StrategySettingsType(**self.other_args) and continue returning
StrategyType(settings) but ensure only the safe representation (keys or masked
dict) is logged, not the original other_args.

Comment on lines +302 to +323
# Check HPA if not allowed
if object_data.hpa is not None and not self.settings.allow_hpa:
logger.info(
f"{object_data.kind} {object_data.namespace}/{object_data.name}: "
f"HPA detected, skipping AI recommendations (use --allow-hpa to override)"
)
if object_data.hpa.target_cpu_utilization_percentage is not None:
cpu_rec = ResourceRecommendation.undefined(info="HPA detected")
else:
cpu_rec = None

if object_data.hpa.target_memory_utilization_percentage is not None:
memory_rec = ResourceRecommendation.undefined(info="HPA detected")
else:
memory_rec = None

if cpu_rec and memory_rec:
return {
ResourceType.CPU: cpu_rec,
ResourceType.Memory: memory_rec,
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

HPA skip logic doesn’t actually prevent AI recommendations.
When HPA targets only one resource, execution continues and AI recommendations are still produced for both resources, even though allow_hpa is false and the log says “skipping.” This violates the intended behavior of allow_hpa.

🛠️ Suggested fix (skip AI when HPA is present and allow_hpa is false)
-            if object_data.hpa is not None and not self.settings.allow_hpa:
-                logger.info(
-                    f"{object_data.kind} {object_data.namespace}/{object_data.name}: "
-                    f"HPA detected, skipping AI recommendations (use --allow-hpa to override)"
-                )
-                if object_data.hpa.target_cpu_utilization_percentage is not None:
-                    cpu_rec = ResourceRecommendation.undefined(info="HPA detected")
-                else:
-                    cpu_rec = None
-                
-                if object_data.hpa.target_memory_utilization_percentage is not None:
-                    memory_rec = ResourceRecommendation.undefined(info="HPA detected")
-                else:
-                    memory_rec = None
-                
-                if cpu_rec and memory_rec:
-                    return {
-                        ResourceType.CPU: cpu_rec,
-                        ResourceType.Memory: memory_rec,
-                    }
+            if object_data.hpa is not None and not self.settings.allow_hpa:
+                logger.info(
+                    f"{object_data.kind} {object_data.namespace}/{object_data.name}: "
+                    f"HPA detected, skipping AI recommendations (use --allow-hpa to override)"
+                )
+                return {
+                    ResourceType.CPU: ResourceRecommendation.undefined(info="HPA detected"),
+                    ResourceType.Memory: ResourceRecommendation.undefined(info="HPA detected"),
+                }
🤖 Prompt for AI Agents
In `@robusta_krr/strategies/ai_assisted.py` around lines 302 - 323, The HPA skip
logic in the block that checks object_data.hpa and self.settings.allow_hpa is
incorrect because it only returns when both cpu_rec and memory_rec are set,
allowing AI recommendations to continue when only one HPA target exists; change
the logic to immediately skip AI by returning a map for both ResourceType.CPU
and ResourceType.Memory using ResourceRecommendation.undefined(info="HPA
detected") (or None only if your callers expect None for absent recs) whenever
object_data.hpa is not None and not self.settings.allow_hpa so that
object_data.hpa, self.settings.allow_hpa, ResourceRecommendation, and
ResourceType are used to build and return the skip result.

Comment on lines +17 to +97
def extract_comprehensive_stats(
history_data: MetricsPodData,
object_data: K8sObjectData
) -> dict:
"""Extract comprehensive statistics from Prometheus metrics data.

This function analyzes the historical data and extracts:
- CPU statistics (percentiles, mean, std, trend, spikes)
- Memory statistics (max, mean, std, OOMKills)
- Pod information (count, names, health)
- Workload context (HPA, allocations, labels)
- Temporal context (duration, data points)

Args:
history_data: Dictionary of metric loaders -> pod data
object_data: Kubernetes object metadata
])

if memory.get('oomkill_detected'):
oomkill_value = memory.get('oomkill_max_value', 0)
prompt_parts.append(
f"- OOM Kill max memory: {oomkill_value:.0f} bytes ({oomkill_value / 1024**2:.1f} Mi)"
)

# Per-pod stats (first 3 pods only in full mode)
per_pod = memory.get("per_pod", {})
if per_pod:
prompt_parts.append("\nPer-pod Memory (sample):")
for pod_name, pod_stats in list(per_pod.items())[:3]:
prompt_parts.append(
f"- {pod_name}: max={pod_stats['max']:.0f} bytes "
f"({pod_stats['max'] / 1024**2:.1f} Mi)"
)

# Current Allocations
if allocations:
prompt_parts.append("\n## Current Resource Allocations")
cpu_alloc = allocations.get("cpu", {})
mem_alloc = allocations.get("memory", {})

cpu_req = cpu_alloc.get("request")
cpu_lim = cpu_alloc.get("limit")
mem_req = mem_alloc.get("request")
mem_lim = mem_alloc.get("limit")

prompt_parts.extend([
f"CPU Request: {cpu_req if cpu_req else 'unset'}",
f"CPU Limit: {cpu_lim if cpu_lim else 'unset'}",
f"Memory Request: {mem_req if mem_req else 'unset'}",
f"Memory Limit: {mem_lim if mem_lim else 'unset'}",
])

# HPA Information
if hpa:
prompt_parts.append("\n## Horizontal Pod Autoscaler (HPA) Detected")
prompt_parts.extend([
f"- Min replicas: {hpa['min_replicas']}",
f"- Max replicas: {hpa['max_replicas']}",
f"- Current replicas: {hpa['current_replicas']}",
])
if hpa['target_cpu_utilization']:
prompt_parts.append(f"- Target CPU utilization: {hpa['target_cpu_utilization']}%")
if hpa['target_memory_utilization']:
prompt_parts.append(f"- Target memory utilization: {hpa['target_memory_utilization']}%")
prompt_parts.append(
"NOTE: With HPA, be conservative with limits to allow scaling to work properly."
)

# Warnings
warnings = stats.get("warnings", [])
if warnings:
prompt_parts.append("\n## Warnings")
for warning in warnings:
prompt_parts.append(f"- {warning}")

prompt_parts.append("\n## Your Task")
prompt_parts.append(
"Based on the above statistics, provide your resource recommendations in JSON format."
)

return "\n".join(prompt_parts)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

extract_comprehensive_stats appears interleaved with prompt-building and returns a string.
The function currently builds prompt_parts and returns a concatenated string, while a separate stats dict is constructed later. If this file is imported, AiAssistedStrategy will break when it expects a dict from extract_comprehensive_stats. This also reads like a merge/copy artifact, and the .py.backup filename suggests it may be unintended. Please clean this up or remove the backup file if it shouldn’t ship.

🤖 Prompt for AI Agents
In `@robusta_krr/strategies/ai_prompts.py.backup` around lines 17 - 97, The file
contains a backup version where extract_comprehensive_stats builds prompt_parts
and returns a string instead of returning the statistics dict expected by
AiAssistedStrategy; remove or fix this artifact by restoring
extract_comprehensive_stats to return the stats dict (use the later-constructed
stats object) and move the prompt assembly (prompt_parts, joins, and
human-readable sections) into a separate function like build_prompt_from_stats
so AiAssistedStrategy can import extract_comprehensive_stats (which should
return dict) and call build_prompt_from_stats(stats) when a string prompt is
needed; alternatively delete the .py.backup file if it is unintended to ship.

Comment on lines +100 to +133
def format_messages(
provider: str,
stats: dict,
object_data: K8sObjectData,
settings: "AiAssistedStrategySettings"
) -> Union[list, str]:
"""Format messages for the specific AI provider.

Args:
provider: AI provider name (openai, gemini, anthropic, ollama)
stats: Statistics dictionary
object_data: Kubernetes object data
settings: Strategy settings

Returns:
Messages in provider-specific format (list of dicts or string)
"""
system_prompt = get_system_prompt(
provider,
include_simple_ref=settings.ai_include_simple_reference
)
user_prompt = get_user_prompt(stats, compact=settings.ai_compact_mode)

# OpenAI and Anthropic use message list format
if provider.lower() in ["openai", "anthropic"]:
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]

# Gemini and Ollama use concatenated string format
else:
return f"{system_prompt}\n\n{user_prompt}"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix ai_include_simple_reference attribute mismatch.
AiAssistedStrategySettings defines ai_exclude_simple_reference, so accessing settings.ai_include_simple_reference will raise at runtime. Use the inverse flag.

🛠️ Proposed fix
-    system_prompt = get_system_prompt(
-        provider, 
-        include_simple_ref=settings.ai_include_simple_reference
-    )
+    system_prompt = get_system_prompt(
+        provider,
+        include_simple_ref=not settings.ai_exclude_simple_reference
+    )
🤖 Prompt for AI Agents
In `@robusta_krr/strategies/ai_prompts.py.backup` around lines 100 - 133, The code
in format_messages uses settings.ai_include_simple_reference which doesn't exist
on AiAssistedStrategySettings; replace this with the inverse of the existing
flag by passing include_simple_ref=not settings.ai_exclude_simple_reference into
get_system_prompt (or compute a local include_simple_ref variable from
settings.ai_exclude_simple_reference and use that) so the system prompt receives
the correct boolean; update references in format_messages (and any related
callers) to use the corrected flag logic.

else
echo "Mode: GKE Cloud"
fi
echo "CPU Percentile: ${CPU_PERCENTILE}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

CPU_PERCENTILE is referenced but not defined.

Line 74 echoes ${CPU_PERCENTILE}, and lines 117/133 use it in the krr.py command, but the variable definition is commented out (lines 25-28). This will result in an empty value being passed to --cpu-percentile.

🐛 Proposed fix: uncomment or provide default
-# CPU_PERCENTILE="${CPU_PERCENTILE:-95}"
-# if [ -n "${4:-}" ]; then
-#     CPU_PERCENTILE="${4}"
-# fi
+CPU_PERCENTILE="${CPU_PERCENTILE:-95}"
+if [ -n "${4:-}" ]; then
+    CPU_PERCENTILE="${4}"
+fi
🤖 Prompt for AI Agents
In `@test_gcp_quick.sh` at line 74, The script echoes and passes CPU_PERCENTILE
but its definition is commented out; restore or set a default for CPU_PERCENTILE
so it is never empty when used by krr.py and in the echo. Re-enable the original
CPU_PERCENTILE export (the commented block that sets CPU_PERCENTILE) or add a
fallback assignment (e.g., set CPU_PERCENTILE to a sensible default using
parameter expansion) so references to CPU_PERCENTILE in the echo and the krr.py
--cpu-percentile flags receive a valid value.

Comment on lines +205 to +225
def test_loader_mapping(self):
"""Test that all expected loaders are mapped."""
from robusta_krr.core.integrations.prometheus.metrics_service.gcp_metrics_service import (
GcpManagedPrometheusMetricsService
)

mapping = GcpManagedPrometheusMetricsService.LOADER_MAPPING

# Verify CPU loaders are mapped
assert "CPULoader" in mapping
assert "PercentileCPULoader" in mapping
assert "CPUAmountLoader" in mapping

# Verify Memory loaders are mapped
assert "MemoryLoader" in mapping
assert "MaxMemoryLoader" in mapping
assert "MemoryAmountLoader" in mapping

# Verify unsupported loader is marked as None
assert "MaxOOMKilledMemoryLoader" in mapping
assert mapping["MaxOOMKilledMemoryLoader"] is None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mismatch with implementation: MaxOOMKilledMemoryLoader is not None.

GcpManagedPrometheusMetricsService.LOADER_MAPPING maps MaxOOMKilledMemoryLoader to GcpMaxOOMKilledMemoryLoader, so this test currently asserts the opposite and will fail (or encode the wrong contract).

✅ Proposed fix
 from robusta_krr.core.integrations.prometheus.metrics.gcp.memory import (
     GcpMemoryLoader,
     GcpMaxMemoryLoader,
     GcpMemoryAmountLoader,
+    GcpMaxOOMKilledMemoryLoader,
 )
 ...
-        assert "MaxOOMKilledMemoryLoader" in mapping
-        assert mapping["MaxOOMKilledMemoryLoader"] is None
+        assert "MaxOOMKilledMemoryLoader" in mapping
+        assert mapping["MaxOOMKilledMemoryLoader"] is GcpMaxOOMKilledMemoryLoader
🤖 Prompt for AI Agents
In `@tests/test_gcp_loaders.py` around lines 205 - 225, The test
test_loader_mapping incorrectly expects
GcpManagedPrometheusMetricsService.LOADER_MAPPING["MaxOOMKilledMemoryLoader"] to
be None while the implementation maps it to GcpMaxOOMKilledMemoryLoader; update
the assertion in test_loader_mapping to assert that "MaxOOMKilledMemoryLoader"
maps to GcpMaxOOMKilledMemoryLoader (or otherwise reflect the intended contract)
by referencing GcpManagedPrometheusMetricsService.LOADER_MAPPING and the
GcpMaxOOMKilledMemoryLoader symbol.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@robusta_krr/core/integrations/ai/base.py`:
- Around line 169-172: The HTTP error handler is currently logging the full
upstream response body which may leak sensitive data; update the except
requests.HTTPError block in robusta_krr/core/integrations/ai/base.py to first
guard that e.response exists, then log the status code and a redacted or
truncated version of the body (e.g., show only the first N characters and append
"[TRUNCATED]" or replace body with "[REDACTED]"), and avoid including
e.response.text verbatim; keep the log message contextual (include
self.__class__.__name__) and ensure no raw prompt/output content is printed.
- Around line 86-90: The retry decorator currently retries on all
requests.RequestException (including requests.HTTPError raised by
response.raise_for_status on line 131), which causes retries for non-transient
4xx errors; replace the retry condition to only retry transient errors by: keep
explicit exception types like requests.Timeout and requests.ConnectionError, and
replace retry_if_exception_type((requests.RequestException, ...)) with a custom
predicate (retry_if_exception) that returns True for Timeout/ConnectionError and
for HTTPError only when exc.response.status_code is 5xx or 429; update the
decorator arguments (stop_after_attempt, wait_exponential remain) and ensure the
predicate references requests.HTTPError and inspects exc.response.status_code to
avoid retrying 4xx.

In `@robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/memory.py`:
- Around line 86-89: The docstring for class AnthosMemoryAmountLoader
incorrectly states it returns a "count of containers" while the implementation
uses Prometheus' count_over_time which counts samples/data points; update the
class docstring (AnthosMemoryAmountLoader) to clarify that it measures the
number of metric samples (data points) in the time window produced by
count_over_time rather than counting distinct containers or pods, and mention
how that maps to container activity if relevant (e.g., more samples indicate
more active containers or higher scrape frequency).
🧹 Nitpick comments (2)
robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/memory.py (1)

176-182: Consider using logger.debug for consistency with other loaders.

Other loaders in this file use logger.debug for query logging. Using logger.info here will produce output at default log levels, which may be noisy in production.

📝 Suggested change
-        logger.info(
+        logger.debug(
             "Anthos OOM detection query (inference-based using restart_count + memory limit) for %s/%s/%s",
             object.namespace,
             object.name,
             object.container,
         )
robusta_krr/core/integrations/ai/base.py (1)

174-182: Preserve exception context; drop unused variable.
Use logger.exception and chain the raised ValueError; remove unused e in timeout block.

♻️ Proposed refactor
-        except requests.Timeout as e:
-            logger.error(f"Timeout calling {self.__class__.__name__} API after {self.timeout}s")
+        except requests.Timeout:
+            logger.exception(f"Timeout calling {self.__class__.__name__} API after {self.timeout}s")
             raise
...
-        except (KeyError, IndexError) as e:
-            logger.error(f"Failed to parse response from {self.__class__.__name__}: {e}")
-            raise ValueError(f"Invalid response format from {self.__class__.__name__}: {e}")
+        except (KeyError, IndexError) as e:
+            logger.exception(f"Failed to parse response from {self.__class__.__name__}: {e}")
+            raise ValueError(f"Invalid response format from {self.__class__.__name__}: {e}") from e

Comment on lines +86 to +90
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type((requests.RequestException, requests.Timeout))
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the target file
find . -name "base.py" -path "*/ai/*" | head -20

Repository: robusta-dev/krr

Length of output: 101


🏁 Script executed:

# Get the specific lines from the file
if [ -f "robusta_krr/core/integrations/ai/base.py" ]; then
  sed -n '80,95p' robusta_krr/core/integrations/ai/base.py
fi

Repository: robusta-dev/krr

Length of output: 503


🏁 Script executed:

# Check imports and broader context
if [ -f "robusta_krr/core/integrations/ai/base.py" ]; then
  head -30 robusta_krr/core/integrations/ai/base.py
fi

Repository: robusta-dev/krr

Length of output: 923


🏁 Script executed:

# Let me check the requests library exception hierarchy
python3 << 'EOF'
import requests

# Print the exception hierarchy
print("requests.RequestException MRO:")
print(requests.RequestException.__mro__)

print("\nrequests.HTTPError MRO:")
print(requests.HTTPError.__mro__)

print("\nIs HTTPError a RequestException?")
print(issubclass(requests.HTTPError, requests.RequestException))

print("\nIs Timeout a RequestException?")
print(issubclass(requests.Timeout, requests.RequestException))
EOF

Repository: robusta-dev/krr

Length of output: 495


🏁 Script executed:

# Let me also check if the analyze_metrics method actually gets HTTPErrors
# by looking at how requests.post or similar methods are called in this file
grep -n "requests\." robusta_krr/core/integrations/ai/base.py | head -20

Repository: robusta-dev/krr

Length of output: 468


🏁 Script executed:

# Check if there's a raise_for_status() call or similar that would convert responses to HTTPError
sed -n '86,150p' robusta_krr/core/integrations/ai/base.py

Repository: robusta-dev/krr

Length of output: 2692


Avoid retrying non-transient 4xx responses.

The retry policy catches requests.RequestException, which includes requests.HTTPError. Since response.raise_for_status() on line 131 raises HTTPError for both 4xx and 5xx responses, the current retry decorator will unnecessarily retry client errors (401/403/404). These are non-recoverable and retrying wastes resources and can worsen rate limits. Restrict retries to transient errors only: timeouts, connection errors, 5xx responses, and 429 (rate limit).

🔧 Proposed fix (narrow retry policy)
-import requests
-from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
+import requests
+from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception
+
+def _is_retryable_exception(exc: Exception) -> bool:
+    if isinstance(exc, requests.HTTPError) and exc.response is not None:
+        return exc.response.status_code >= 500 or exc.response.status_code == 429
+    return isinstance(exc, (requests.Timeout, requests.ConnectionError))
...
-    `@retry`(
-        stop=stop_after_attempt(3),
-        wait=wait_exponential(multiplier=1, min=2, max=10),
-        retry=retry_if_exception_type((requests.RequestException, requests.Timeout))
-    )
+    `@retry`(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=2, max=10),
+        retry=retry_if_exception(_is_retryable_exception),
+    )
🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/ai/base.py` around lines 86 - 90, The retry
decorator currently retries on all requests.RequestException (including
requests.HTTPError raised by response.raise_for_status on line 131), which
causes retries for non-transient 4xx errors; replace the retry condition to only
retry transient errors by: keep explicit exception types like requests.Timeout
and requests.ConnectionError, and replace
retry_if_exception_type((requests.RequestException, ...)) with a custom
predicate (retry_if_exception) that returns True for Timeout/ConnectionError and
for HTTPError only when exc.response.status_code is 5xx or 429; update the
decorator arguments (stop_after_attempt, wait_exponential remain) and ensure the
predicate references requests.HTTPError and inspects exc.response.status_code to
avoid retrying 4xx.

Comment on lines +169 to +172
except requests.HTTPError as e:
logger.error(
f"HTTP error from {self.__class__.__name__}: {e.response.status_code} - {e.response.text}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging full upstream response bodies.
HTTP error logs can leak sensitive prompt/output data. Truncate or redact the body and guard missing responses.

🔒 Proposed fix (truncate/redact body)
-        except requests.HTTPError as e:
-            logger.error(
-                f"HTTP error from {self.__class__.__name__}: {e.response.status_code} - {e.response.text}"
-            )
+        except requests.HTTPError as e:
+            status = getattr(e.response, "status_code", "unknown")
+            body_preview = (e.response.text or "")[:500] if e.response is not None else ""
+            logger.error(
+                f"HTTP error from {self.__class__.__name__}: {status} - {body_preview}"
+            )
             raise
🧰 Tools
🪛 Ruff (0.14.13)

170-172: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/ai/base.py` around lines 169 - 172, The HTTP
error handler is currently logging the full upstream response body which may
leak sensitive data; update the except requests.HTTPError block in
robusta_krr/core/integrations/ai/base.py to first guard that e.response exists,
then log the status code and a redacted or truncated version of the body (e.g.,
show only the first N characters and append "[TRUNCATED]" or replace body with
"[REDACTED]"), and avoid including e.response.text verbatim; keep the log
message contextual (include self.__class__.__name__) and ensure no raw
prompt/output content is printed.

Comment on lines +86 to +89
class AnthosMemoryAmountLoader(PrometheusMetric):
"""
Loads memory amount (count of containers) for Anthos.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the docstring - count_over_time counts samples, not containers.

The docstring states "count of containers" but count_over_time counts the number of data points/samples in the time range, not the number of containers. Consider clarifying the purpose:

📝 Suggested docstring fix
 class AnthosMemoryAmountLoader(PrometheusMetric):
     """
-    Loads memory amount (count of containers) for Anthos.
+    Loads memory sample count over time for Anthos containers.
+    Used to determine data availability and sample density.
     """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class AnthosMemoryAmountLoader(PrometheusMetric):
"""
Loads memory amount (count of containers) for Anthos.
"""
class AnthosMemoryAmountLoader(PrometheusMetric):
"""
Loads memory sample count over time for Anthos containers.
Used to determine data availability and sample density.
"""
🤖 Prompt for AI Agents
In `@robusta_krr/core/integrations/prometheus/metrics/gcp/anthos/memory.py` around
lines 86 - 89, The docstring for class AnthosMemoryAmountLoader incorrectly
states it returns a "count of containers" while the implementation uses
Prometheus' count_over_time which counts samples/data points; update the class
docstring (AnthosMemoryAmountLoader) to clarify that it measures the number of
metric samples (data points) in the time window produced by count_over_time
rather than counting distinct containers or pods, and mention how that maps to
container activity if relevant (e.g., more samples indicate more active
containers or higher scrape frequency).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants