Skip to content

Conversation

@fredricz-20070104
Copy link
Collaborator

@fredricz-20070104 fredricz-20070104 commented Dec 30, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Added stress test support for disaggregated benchmarks with combined performance and accuracy validation.
    • Introduced DEFAULT backend normalization and filtering logic for backend comparison analysis.
    • Expanded test configuration suite with new stress and performance test variants.
  • Documentation

    • Added comprehensive README documenting stress test configurations, workflows, and best practices.
  • Improvements

    • Enhanced job submission infrastructure with improved SLURM job ID parsing and result tracking.
    • Extended accuracy testing with optional custom metrics configuration support.

✏️ Tip: You can customize this high-level summary in your review settings.

a) Unified slurm extra args management
b) Unified session collection logic
c) Add deepseek v32 models

@fredricz-20070104 fredricz-20070104 changed the title [None][test] Unified slurm extra args management and session collection logic [None][chore] Unified slurm extra args management and session collection logic Dec 30, 2025
@fredricz-20070104 fredricz-20070104 changed the title [None][chore] Unified slurm extra args management and session collection logic [None][test] Unified slurm extra args management and session collection logic Dec 30, 2025
@fredricz-20070104 fredricz-20070104 enabled auto-merge (squash) December 30, 2025 02:01
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 30, 2025

📝 Walkthrough

Walkthrough

This PR extends the disaggregated test framework to support stress testing, refactors job submission via a new JobManager class, introduces shell-based session collection, and expands test configurations for stress and performance benchmarking scenarios across multiple model variants.

Changes

Cohort / File(s) Summary
Job Submission Refactoring
tests/integration/defs/perf/disagg/execution/executor.py, tests/integration/defs/perf/disagg/utils/trackers.py
Replaced SlurmRunCommandBuilder with generalized JobManager class. Added submit_shell_job() and submit_session_collect_job() static methods. Renamed submit_job() to submit_test_job(). Enhanced result path resolution via get_result_dir(). Introduced stress-test routing in _check_job_result() with combined perf/accuracy validation. Session collection now uses non-blocking sbatch submission instead of direct command execution.
Configuration Structure & Loading
tests/integration/defs/perf/disagg/utils/config_loader.py, tests/integration/defs/perf/disagg/utils/common.py, tests/integration/defs/perf/disagg/compare_backends.py
Added optional metrics field to AccuracyConfig with custom metrics support and get_metrics_config() method. Updated DEFAULT_METRICS_CONFIG with stress test metrics. Replaced per-GPU gres_gpu entries in GPU_RESOURCE_CONFIG with slurm_extra_args and set_segment keys. Removed derived fields (dep_flag, log_base, context_dir) from extract_config_fields(). Extended backend parsing to recognize and normalize ccb-DEFAULT entries; added filtering logic for valid comparison pairs.
Test Suite Expansion
tests/integration/defs/perf/disagg/test_disagg.py, tests/integration/defs/perf/disagg/testlist/{all,disagg,wideep}.txt
Added STRESS_TEST_CONFIGS, STRESS_TEST_CASES, and new test_stress() method to TestDisaggBenchmark for combined perf/accuracy validation. Updated submit_job() calls to submit_test_job(). Added stress test entry to testlist/disagg.txt and five new wideep v32 perf test entries across all.txt and wideep.txt.
Session Collection & Artifact Handling
tests/integration/defs/perf/disagg/session_collect.sh, tests/integration/defs/perf/disagg/simple_collect.py
New shell script for orchestrating session collection with configurable install modes (none, wheel, source), system info collection, and TensorRT-LLM version capture. Removed write_trtllm_version() method from TextWriter class; version capture moved to session_collect.sh.
Stress Test Configurations
tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md, tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
New comprehensive README documenting stress test structure, configuration templates, field reference, execution flow, output formats, and best practices. New stress test YAML for deepseek-r1 with FP4 precision, defining slurm, benchmark, hardware, environment, profiling, accuracy, and worker configurations.
Perf Test Configurations
tests/integration/defs/perf/disagg/test_configs/wideep/perf/{deepseek-v32-fp4_*.yaml}, tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml, tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/{deepseek-r1,kimi-k2}_*.yaml
Added six new wideep perf YAML configurations for deepseek-v32-fp4 across multiple context/batch/parallelism variants with ccb-DEFAULT and ccb-NIXL backends. Updated deepseek-r1 perf config (benchmark_type and dataset file changes). Increased job_time in two wideep accuracy configs from 00:45:00–02:00:00 to 03:00:00.

Sequence Diagrams

sequenceDiagram
    participant Test as test_stress()
    participant TM as TestCaseTracker
    participant JM as JobManager
    participant SLURM as sbatch
    participant PC as Performance Checker
    participant AC as Accuracy Checker
    
    Test->>TM: track_test(test_config)
    Test->>JM: submit_test_job(test_config)
    JM->>SLURM: sbatch submit.py
    SLURM-->>JM: job_id
    JM->>SLURM: wait_for_completion(job_id)
    SLURM-->>JM: complete
    Test->>JM: check_result(job_id, test_config,<br/>test_category="stress")
    Note over JM: Routes to _check_job_result<br/>with "stress" handling
    JM->>PC: _check_perf_result()
    PC-->>JM: perf_results
    alt perf_success
        JM->>AC: _check_accuracy_result()
        AC-->>JM: accuracy_results
        JM-->>Test: combined perf+accuracy
    else perf_failure
        JM-->>Test: combined results<br/>with error details
    end
    Test->>JM: backup_logs(final_dir)
    JM-->>Test: complete
    TM->>TM: record result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description is minimal and uses an alternative template format (@coderabbitai summary). While it mentions the three main objectives (unified SLURM args, unified session collection, DeepSeek v32 models), it lacks the standard template structure including detailed explanations, test coverage, and PR checklist items. Consider providing a more comprehensive description following the standard template: explain the issue/rationale, detail test coverage, and complete the PR checklist to ensure full context for reviewers.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title is partially related to the changeset, specifically capturing unified SLURM and session collection logic but omitting the addition of DeepSeek v32 models, which is a significant change.
Docstring Coverage ✅ Passed Docstring coverage is 95.65% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tests/integration/defs/perf/disagg/utils/common.py (1)

54-59: Typos in default placeholder values.

The default values contain typos: "You" should be "Your".

🔎 Proposed fix for typos
     @staticmethod
     def get_slurm_partition() -> str:
-        return os.getenv("SLURM_PARTITION", "<You slurm partition>")
+        return os.getenv("SLURM_PARTITION", "<Your slurm partition>")

     @staticmethod
     def get_slurm_account() -> str:
-        return os.getenv("SLURM_ACCOUNT", "<You slurm account>")
+        return os.getenv("SLURM_ACCOUNT", "<Your slurm account>")
tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml (1)

67-81: Duplicate batch size in CUDA graph configuration.

The batch_sizes list contains 1024 twice (lines 79 and 81). This duplicate may cause unnecessary CUDA graph compilation overhead.

🔎 Proposed fix to remove duplicate
       batch_sizes:
       - 1
       - 2
       - 4
       - 8
       - 16
       - 32
       - 64
       - 128
       - 256
       - 512
       - 768
       - 1024
       - 2048
-      - 1024
tests/integration/defs/perf/disagg/utils/trackers.py (1)

1-11: Missing NVIDIA copyright header.

Per the coding guidelines, all Python files should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.

🔎 Suggested fix
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 from datetime import datetime
tests/integration/defs/perf/disagg/compare_backends.py (1)

1-11: Missing NVIDIA copyright header.

Per the coding guidelines, all Python files should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.

🧹 Nitpick comments (11)
tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md (1)

218-220: Add language specifiers to fenced code blocks.

Several code blocks are missing language specifiers, which affects rendering and syntax highlighting. Per static analysis hints:

  • Line 218: model_args_extra example
  • Line 238: Test execution flow diagram
  • Line 265: Log directory structure
  • Line 276: CSV output path
  • Line 284: Error directory example
🔎 Proposed fix for code block language specifiers
 **Common `model_args_extra` parameters:**
-```
+```text
 num_concurrent=512,max_retries=3,tokenized_requests=false,timeout=1200,max_gen_toks=256,max_length=4096

```diff
 ## Test Execution Flow
 
-```
+```text
 1. Configuration Validation
    ↓
 ...
 ### Log Directory
-```
+```text
 {OUTPUT_PATH}/slurm_logs/disagg_stress_{test_id}/
 ...
 ### CSV Output
-```
+```text
 {OUTPUT_PATH}/perf_script_test_results.csv

```diff
 ### Failed Test Directories
 Failed tests are automatically renamed with `_ERROR` suffix:
-```
+```text
 disagg_stress_{test_id}_ERROR/
</details>

</blockquote></details>
<details>
<summary>tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml (1)</summary><blockquote>

`1-11`: **Consider adding the nvbugs reference for consistency.**

Other similar configuration files (e.g., `deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml`) include a `# nvbugs: 5422621` comment at the top. Adding this reference would maintain consistency across the v32 configuration files.

<details>
<summary>🔎 Suggested fix</summary>

```diff
+# nvbugs: 5422621
 metadata:
   model_name: deepseek-v32-fp4
   precision: fp4
tests/integration/defs/perf/disagg/utils/trackers.py (1)

92-99: Consider capturing and logging the wait_for_completion result.

The return value of wait_for_completion is currently ignored. If the job times out or fails early, this information could be valuable for debugging. Consider capturing and logging the result even if the log file check remains the definitive success indicator.

🔎 Suggested improvement
         # Wait for job completion (reuses wait_for_completion method)
         logger.info(f"Waiting for session collect job {job_id} to complete...")
-        JobManager.wait_for_completion(
+        wait_result = JobManager.wait_for_completion(
             job_id=job_id,
             timeout=7200,  # 2 hours
             test_config=None,  # No test config for session collect
             check_early_failure=False  # Don't check early failures
         )
+        if not wait_result:
+            logger.warning(f"Session collect job {job_id} may have timed out or failed")
tests/integration/defs/perf/disagg/session_collect.sh (1)

35-43: Consider adding a warning when no wheel files are found.

The loop will silently complete even if no wheel files match the pattern, which could mask configuration errors. Adding feedback when no wheels are found would improve debuggability.

🔎 Suggested improvement
     # Expand wildcard and install
+    wheel_found=false
     for wheel_file in $WHEEL_PATH; do
         if [ -f "$wheel_file" ]; then
             echo "Found wheel: $wheel_file"
             pip3 install "$wheel_file" 2>&1 || echo "Wheel install failed, continuing..."
+            wheel_found=true
             break
         fi
     done
+    if [ "$wheel_found" = false ]; then
+        echo "WARNING: No wheel files found matching pattern: $WHEEL_PATH"
+    fi
     echo "Wheel installation completed"
tests/integration/defs/perf/disagg/compare_backends.py (3)

90-90: Rename unused loop variable to indicate intentional disuse.

The base_case variable is extracted from the group but not used within the loop body. Rename it to _base_case to indicate it's intentionally unused.

🔎 Suggested fix
-    for (base_case, metric_type), group in grouped:
+    for (_base_case, metric_type), group in grouped:

160-171: Remove extraneous f-prefixes from strings without placeholders.

These strings don't contain any format placeholders, so the f prefix is unnecessary.

🔎 Suggested fix
     # Print statistics
-    print(f"\n=== Backend Comparison Statistics ===")
+    print("\n=== Backend Comparison Statistics ===")
     print(f"Default backend: {default_backend}")
     print(f"Comparison pairs: {comparison_pairs}")
     print(f"Single-backend cases (skipped): {single_backend_skipped}")
-    print("=" * 37)
+    print("=" * 37)

     # If no comparison pairs found, exit with success
     if comparison_pairs == 0:
-        print(f"\nInfo: No backend comparison pairs found in disagg_perf tests")
-        print(f"All cases are single-backend only, no comparison needed")
+        print("\nInfo: No backend comparison pairs found in disagg_perf tests")
+        print("All cases are single-backend only, no comparison needed")
         sys.exit(0)

108-115: Redundant length checks after early-exit logic.

Since lines 101-103 already ensure both default_data and ucx_data have data (the single-backend cases are skipped), the length checks at lines 109 and 112 will always evaluate to True. The conditional expressions can be simplified.

🔎 Suggested fix
         # Extract values and original test names
-        default_value = default_data["perf_metric"].values[0] if len(default_data) > 0 else None
-        default_original_name = (
-            default_data["network_name"].values[0] if len(default_data) > 0 else None
-        )
+        default_value = default_data["perf_metric"].values[0]
+        default_original_name = default_data["network_name"].values[0]

-        ucx_value = ucx_data["perf_metric"].values[0] if len(ucx_data) > 0 else None
-        ucx_original_name = ucx_data["network_name"].values[0] if len(ucx_data) > 0 else None
+        ucx_value = ucx_data["perf_metric"].values[0]
+        ucx_original_name = ucx_data["network_name"].values[0]

Note: If you keep the redundant checks for defensive coding, you can also remove the None check at line 123 since both values are guaranteed to be non-None at that point (unless the CSV contains actual None/NaN values in the perf_metric column, which is a different concern).

tests/integration/defs/perf/disagg/utils/config_loader.py (1)

411-411: Remove unnecessary f-string prefix.

The f-string has no placeholders. Remove the f prefix.

🔎 Proposed fix
-                    logger.info(f"Using custom accuracy metrics config from YAML")
+                    logger.info("Using custom accuracy metrics config from YAML")
tests/integration/defs/perf/disagg/test_disagg.py (1)

290-292: Consider using bare raise instead of raise e.

Using raise without the exception name preserves the original traceback, which is preferred for re-raising exceptions. However, this pattern is consistent with test_benchmark (line 126) and test_accuracy (line 206), so this is optional to change.

🔎 Proposed fix
         except Exception as e:
             test_tracker.end_test_case()
-            raise e
+            raise
tests/integration/defs/perf/disagg/execution/executor.py (2)

32-40: Use explicit Optional[] for Python 3.8+ compatibility.

Per coding guidelines, code should conform to Python 3.8+. The = None default without explicit Optional type annotation is valid in Python 3.10+ but can cause issues with type checkers in earlier versions.

🔎 Proposed fix
     @staticmethod
     def submit_shell_job(
         job_name: str,
         script_path: str,
-        script_args: list[str] = None,
-        output_log_file: str = None,
+        script_args: Optional[list[str]] = None,
+        output_log_file: Optional[str] = None,
         timeout: int = 7200,
-        container_name: str = None
+        container_name: Optional[str] = None
     ) -> tuple[bool, str]:

Based on coding guidelines: "Code developed for TensorRT-LLM should conform to Python 3.8+"


235-238: Whitespace inconsistency at line 235.

There's trailing whitespace and inconsistent indentation at line 235.

🔎 Proposed fix
-            submit_script = os.path.join(EnvManager.get_script_dir(), "submit.py")
-           
-            case_log_dir = JobManager.get_result_dir(test_config)
-
-            cmd = ["python3", submit_script, "-c", temp_config_path, "--log-dir", case_log_dir]
+            submit_script = os.path.join(EnvManager.get_script_dir(), "submit.py")
+            case_log_dir = JobManager.get_result_dir(test_config)
+            cmd = ["python3", submit_script, "-c", temp_config_path, "--log-dir", case_log_dir]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4944192 and 8dcb8dc.

📒 Files selected for processing (22)
  • tests/integration/defs/perf/disagg/compare_backends.py
  • tests/integration/defs/perf/disagg/execution/executor.py
  • tests/integration/defs/perf/disagg/session_collect.sh
  • tests/integration/defs/perf/disagg/simple_collect.py
  • tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md
  • tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/deepseek-r1-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_disagg.py
  • tests/integration/defs/perf/disagg/testlist/all.txt
  • tests/integration/defs/perf/disagg/testlist/disagg.txt
  • tests/integration/defs/perf/disagg/testlist/wideep.txt
  • tests/integration/defs/perf/disagg/utils/common.py
  • tests/integration/defs/perf/disagg/utils/config_loader.py
  • tests/integration/defs/perf/disagg/utils/trackers.py
💤 Files with no reviewable changes (1)
  • tests/integration/defs/perf/disagg/simple_collect.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

  • tests/integration/defs/perf/disagg/utils/config_loader.py
  • tests/integration/defs/perf/disagg/compare_backends.py
  • tests/integration/defs/perf/disagg/utils/common.py
  • tests/integration/defs/perf/disagg/utils/trackers.py
  • tests/integration/defs/perf/disagg/test_disagg.py
  • tests/integration/defs/perf/disagg/execution/executor.py
**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

  • tests/integration/defs/perf/disagg/utils/config_loader.py
  • tests/integration/defs/perf/disagg/compare_backends.py
  • tests/integration/defs/perf/disagg/utils/common.py
  • tests/integration/defs/perf/disagg/utils/trackers.py
  • tests/integration/defs/perf/disagg/test_disagg.py
  • tests/integration/defs/perf/disagg/execution/executor.py
🧠 Learnings (2)
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

  • tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
  • tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
🧬 Code graph analysis (3)
tests/integration/defs/perf/disagg/utils/trackers.py (3)
tests/integration/defs/perf/disagg/execution/executor.py (3)
  • JobManager (25-860)
  • submit_session_collect_job (151-193)
  • wait_for_completion (476-535)
tests/integration/defs/perf/disagg/utils/common.py (2)
  • EnvManager (46-175)
  • get_output_path (125-132)
tests/integration/defs/perf/disagg/utils/logger.py (3)
  • info (81-83)
  • success (97-99)
  • error (89-91)
tests/integration/defs/perf/disagg/test_disagg.py (4)
tests/integration/defs/perf/disagg/execution/executor.py (5)
  • submit_test_job (200-270)
  • wait_for_completion (476-535)
  • check_result (353-400)
  • get_result_dir (338-350)
  • backup_logs (273-335)
tests/integration/defs/perf/disagg/utils/config_loader.py (2)
  • TestConfig (173-191)
  • display_name (189-191)
tests/integration/defs/perf/disagg/utils/trackers.py (4)
  • TestCaseTracker (14-61)
  • start_test_case (22-28)
  • end_test_case (30-38)
  • get_timestamps (40-55)
tests/integration/defs/perf/disagg/utils/common.py (3)
  • EnvManager (46-175)
  • get_debug_mode (170-171)
  • get_debug_job_id (174-175)
tests/integration/defs/perf/disagg/execution/executor.py (3)
tests/integration/defs/perf/disagg/reporting/report.py (1)
  • LogParser (31-238)
tests/integration/defs/perf/disagg/utils/common.py (11)
  • EnvManager (46-175)
  • get_container_image (97-98)
  • get_container_mount (139-167)
  • get_output_path (125-132)
  • get_slurm_partition (54-55)
  • get_slurm_account (58-59)
  • get_slurm_extra_args (77-94)
  • get_work_dir (105-106)
  • get_repo_dir (109-110)
  • get_install_mode (135-136)
  • get_trtllm_wheel_path (113-114)
tests/integration/defs/perf/disagg/execution/subprocess_utils.py (2)
  • exec_cmd (18-33)
  • exec_cmd_with_output (36-63)
🪛 markdownlint-cli2 (0.18.1)
tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md

218-218: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


238-238: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


265-265: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


276-276: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


284-284: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
tests/integration/defs/perf/disagg/utils/config_loader.py

411-411: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/integration/defs/perf/disagg/compare_backends.py

90-90: Loop control variable base_case not used within loop body

Rename unused base_case to _base_case

(B007)


161-161: f-string without any placeholders

Remove extraneous f prefix

(F541)


169-169: f-string without any placeholders

Remove extraneous f prefix

(F541)


170-170: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/integration/defs/perf/disagg/test_disagg.py

230-230: Do not catch blind exception: Exception

(BLE001)


292-292: Use raise without specifying exception name

Remove exception name

(TRY201)


300-300: Do not catch blind exception: Exception

(BLE001)

tests/integration/defs/perf/disagg/execution/executor.py

36-36: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


37-37: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


39-39: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


138-138: Consider moving this statement to an else block

(TRY300)


140-140: Do not catch blind exception: Exception

(BLE001)


189-189: Do not catch blind exception: Exception

(BLE001)


323-323: Consider moving this statement to an else block

(TRY300)


325-325: Do not catch blind exception: Exception

(BLE001)

🪛 Shellcheck (0.11.0)
tests/integration/defs/perf/disagg/session_collect.sh

[warning] 21-21: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[warning] 47-47: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (24)
tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/deepseek-r1-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml (1)

22-22: LGTM!

The job time increase to 3 hours is reasonable for accuracy tests that include GSM8K evaluation alongside benchmarking.

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yaml (1)

1-110: LGTM!

New performance configuration for DeepSeek-V3.2-FP4 with 8k input/1k output benchmark. Configuration is well-structured with appropriate resource allocation for 6 context servers and proper NIXL cache transceiver backend.

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yaml (1)

1-110: LGTM!

New performance configuration for DeepSeek-V3.2-FP4 with 1k1k benchmark. Configuration is consistent with naming conventions and properly configured for NIXL cache transceiver backend.

tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml (1)

1-2: LGTM on stress test structure.

The configuration correctly includes both metadata.accuracy and top-level accuracy sections as required for stress tests. The nvbugs reference (5422621) provides traceability.

tests/integration/defs/perf/disagg/utils/common.py (2)

6-43: Good refactor: Centralized GPU resource configuration.

The consolidation of GPU-specific SLURM parameters into a single GPU_RESOURCE_CONFIG dictionary improves maintainability and makes it easier to add new GPU types or modify existing configurations.


65-94: Well-documented methods with Google-style docstrings.

The updated get_slurm_set_segment() and get_slurm_extra_args() methods include comprehensive docstrings following Google style conventions, with clear descriptions, return types, and examples.

tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml (1)

22-22: Significant job time increase.

Job time increased from 45 minutes to 3 hours. This is reasonable for accuracy tests that require more time for GSM8K evaluation, but ensure the cluster scheduling policy accommodates longer jobs.

tests/integration/defs/perf/disagg/testlist/disagg.txt (1)

27-27: LGTM!

The new stress test entry follows the established naming convention and correctly references the test_stress method with the DEFAULT backend configuration.

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml (1)

10-12: LGTM!

The benchmark type and dataset file changes are internally consistent—the 1k1k benchmark type correctly maps to the deepseek-r1-1024-1024-100000-ratio-1_for_serve.json dataset file, matching the input_length: 1024 and output_length: 1024 settings.

tests/integration/defs/perf/disagg/testlist/wideep.txt (1)

11-16: LGTM!

The new DeepSeek V3.2 FP4 test entries follow the established naming convention and provide appropriate coverage across different benchmark configurations (8k1k, 1k1k) and backends (DEFAULT, NIXL).

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml (1)

1-110: LGTM!

The new DeepSeek V3.2 FP4 configuration is well-structured and consistent with existing configs. The DEFAULT backend setting in both gen and ctx cache_transceiver_config sections correctly aligns with the filename convention.

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yaml (1)

1-116: Configuration looks complete and well-structured.

The YAML configuration follows the established pattern for wideep perf tests. A few observations:

  1. The cuda_graph_config for ctx (line 105) is set to null, which disables CUDA graphs for context servers - this is intentional for context-only workloads.
  2. The num_ctx_servers: 8 with num_gen_servers: 1 ratio matches the filename convention (ctx8_gen1).
  3. MTP3 speculative decoding is correctly configured for both gen and ctx workers.
tests/integration/defs/perf/disagg/testlist/all.txt (2)

29-31: New stress test entry correctly references the test_stress method.

The stress test case at line 30 correctly uses test_stress[...] which aligns with the new test_stress method added to TestDisaggBenchmark in test_disagg.py. The naming convention follows the established pattern.


44-49: New wideep v32 test entries are well-formed.

The new deepseek-v32-fp4 test cases follow the established naming convention and correctly use test_benchmark[...] for performance tests. The variety of configurations (ctx2/ctx6/ctx8, NIXL/DEFAULT backends, different batch sizes) provides good coverage.

tests/integration/defs/perf/disagg/utils/config_loader.py (3)

48-48: Good addition of optional metrics configuration.

The optional metrics field with None default allows per-test custom metrics configuration while maintaining backward compatibility with existing accuracy tests.


71-80: Clean fallback pattern for metrics configuration.

The get_metrics_config() method provides a sensible default fallback to _COMMON_ACCURACY_METRICS when no custom metrics are provided, enabling both flexibility and convenience.


363-414: Unified handling for accuracy and stress test categories is well-implemented.

The logic correctly treats "stress" test category similar to "accuracy" for loading accuracy configurations, while also supporting optional custom metrics overrides. The conditional at line 364 cleanly extends the existing pattern.

tests/integration/defs/perf/disagg/test_disagg.py (3)

17-25: Stress test configuration filtering follows established patterns.

The filtering logic for STRESS_TEST_CONFIGS and conversion to STRESS_TEST_CASES follows the same clean pattern used for performance and accuracy tests.


105-105: Method rename from submit_job to submit_test_job is consistent.

The rename aligns with the updated JobManager API in executor.py.


217-301: New stress test method is well-structured.

The test_stress method correctly combines performance benchmarking with accuracy validation. The implementation follows the established patterns from test_benchmark and test_accuracy, including proper tracker usage, configuration validation, and log backup in the finally block.

tests/integration/defs/perf/disagg/execution/executor.py (4)

82-88: The srun command construction looks correct.

The command properly sets up container execution with the necessary mounts and image configuration. The bash command is correctly quoted to handle arguments with spaces.


150-193: Clean delegation pattern for session collection job.

The submit_session_collect_job method appropriately delegates to the generic submit_shell_job, passing the required environment-derived arguments. This promotes code reuse and consistency.


293-323: Good improvement to backup_logs for failed test identification.

The _ERROR suffix for failed test directories is a practical improvement for quickly identifying failed tests when browsing log directories. The cleanup of existing error directories before renaming is appropriate to avoid conflicts.


801-849: Stress test result checking correctly combines perf and accuracy validation.

The implementation properly:

  1. Runs perf validation first and fails fast if it doesn't pass
  2. Only proceeds to accuracy validation if perf passes
  3. Uses accuracy_config.get_metrics_config() to get the appropriate metrics configuration
  4. Merges results appropriately for both success and failure cases

One consideration: if accuracy_config is None for a stress test, only a warning is logged. This may be intentional for flexibility, but you might want to make accuracy config required for stress tests.

Please verify that all stress test configurations include an accuracy section in their YAML metadata. If accuracy validation is always expected for stress tests, consider making accuracy_config required instead of optional with a warning.

@fredricz-20070104
Copy link
Collaborator Author

/bot run --skip-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30109 [ run ] triggered by Bot. Commit: 53c5e9a

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30109 [ run ] completed with state FAILURE. Commit: 53c5e9a
/LLM/main/L0_MergeRequest_PR pipeline #23170 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@fredricz-20070104
Copy link
Collaborator Author

/bot run --skip-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30137 [ run ] triggered by Bot. Commit: f7e3db6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30137 [ run ] completed with state FAILURE. Commit: f7e3db6
/LLM/main/L0_MergeRequest_PR pipeline #23191 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@fredricz-20070104
Copy link
Collaborator Author

/bot run --skip-test

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
… backend result for comparison

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants