[None][test] Unified slurm extra args management and session collection logic #10332

fredricz-20070104 · 2025-12-30T01:56:58Z

Summary by CodeRabbit

Release Notes

New Features
- Added stress test support for disaggregated benchmarks with combined performance and accuracy validation.
- Introduced DEFAULT backend normalization and filtering logic for backend comparison analysis.
- Expanded test configuration suite with new stress and performance test variants.
Documentation
- Added comprehensive README documenting stress test configurations, workflows, and best practices.
Improvements
- Enhanced job submission infrastructure with improved SLURM job ID parsing and result tracking.
- Extended accuracy testing with optional custom metrics configuration support.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

a) Unified slurm extra args management
b) Unified session collection logic
c) Add deepseek v32 models

coderabbitai · 2025-12-30T02:02:46Z

📝 Walkthrough

Walkthrough

This PR extends the disaggregated test framework to support stress testing, refactors job submission via a new JobManager class, introduces shell-based session collection, and expands test configurations for stress and performance benchmarking scenarios across multiple model variants.

Changes

Cohort / File(s)	Summary
Job Submission Refactoring `tests/integration/defs/perf/disagg/execution/executor.py`, `tests/integration/defs/perf/disagg/utils/trackers.py`	Replaced `SlurmRunCommandBuilder` with generalized `JobManager` class. Added `submit_shell_job()` and `submit_session_collect_job()` static methods. Renamed `submit_job()` to `submit_test_job()`. Enhanced result path resolution via `get_result_dir()`. Introduced stress-test routing in `_check_job_result()` with combined perf/accuracy validation. Session collection now uses non-blocking sbatch submission instead of direct command execution.
Configuration Structure & Loading `tests/integration/defs/perf/disagg/utils/config_loader.py`, `tests/integration/defs/perf/disagg/utils/common.py`, `tests/integration/defs/perf/disagg/compare_backends.py`	Added optional `metrics` field to `AccuracyConfig` with custom metrics support and `get_metrics_config()` method. Updated `DEFAULT_METRICS_CONFIG` with stress test metrics. Replaced per-GPU `gres_gpu` entries in `GPU_RESOURCE_CONFIG` with `slurm_extra_args` and `set_segment` keys. Removed derived fields (`dep_flag`, `log_base`, `context_dir`) from `extract_config_fields()`. Extended backend parsing to recognize and normalize `ccb-DEFAULT` entries; added filtering logic for valid comparison pairs.
Test Suite Expansion `tests/integration/defs/perf/disagg/test_disagg.py`, `tests/integration/defs/perf/disagg/testlist/{all,disagg,wideep}.txt`	Added `STRESS_TEST_CONFIGS`, `STRESS_TEST_CASES`, and new `test_stress()` method to `TestDisaggBenchmark` for combined perf/accuracy validation. Updated `submit_job()` calls to `submit_test_job()`. Added stress test entry to testlist/disagg.txt and five new wideep v32 perf test entries across all.txt and wideep.txt.
Session Collection & Artifact Handling `tests/integration/defs/perf/disagg/session_collect.sh`, `tests/integration/defs/perf/disagg/simple_collect.py`	New shell script for orchestrating session collection with configurable install modes (none, wheel, source), system info collection, and TensorRT-LLM version capture. Removed `write_trtllm_version()` method from `TextWriter` class; version capture moved to session_collect.sh.
Stress Test Configurations `tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md`, `tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml`	New comprehensive README documenting stress test structure, configuration templates, field reference, execution flow, output formats, and best practices. New stress test YAML for deepseek-r1 with FP4 precision, defining slurm, benchmark, hardware, environment, profiling, accuracy, and worker configurations.
Perf Test Configurations `tests/integration/defs/perf/disagg/test_configs/wideep/perf/{deepseek-v32-fp4_.yaml}`, `tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml`, `tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/{deepseek-r1,kimi-k2}_.yaml`	Added six new wideep perf YAML configurations for deepseek-v32-fp4 across multiple context/batch/parallelism variants with ccb-DEFAULT and ccb-NIXL backends. Updated deepseek-r1 perf config (benchmark_type and dataset file changes). Increased job_time in two wideep accuracy configs from 00:45:00–02:00:00 to 03:00:00.

Sequence Diagrams

sequenceDiagram
    participant Test as test_stress()
    participant TM as TestCaseTracker
    participant JM as JobManager
    participant SLURM as sbatch
    participant PC as Performance Checker
    participant AC as Accuracy Checker
    
    Test->>TM: track_test(test_config)
    Test->>JM: submit_test_job(test_config)
    JM->>SLURM: sbatch submit.py
    SLURM-->>JM: job_id
    JM->>SLURM: wait_for_completion(job_id)
    SLURM-->>JM: complete
    Test->>JM: check_result(job_id, test_config,<br/>test_category="stress")
    Note over JM: Routes to _check_job_result<br/>with "stress" handling
    JM->>PC: _check_perf_result()
    PC-->>JM: perf_results
    alt perf_success
        JM->>AC: _check_accuracy_result()
        AC-->>JM: accuracy_results
        JM-->>Test: combined perf+accuracy
    else perf_failure
        JM-->>Test: combined results<br/>with error details
    end
    Test->>JM: backup_logs(final_dir)
    JM-->>Test: complete
    TM->>TM: record result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description is minimal and uses an alternative template format (@coderabbitai summary). While it mentions the three main objectives (unified SLURM args, unified session collection, DeepSeek v32 models), it lacks the standard template structure including detailed explanations, test coverage, and PR checklist items.	Consider providing a more comprehensive description following the standard template: explain the issue/rationale, detail test coverage, and complete the PR checklist to ensure full context for reviewers.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title is partially related to the changeset, specifically capturing unified SLURM and session collection logic but omitting the addition of DeepSeek v32 models, which is a significant change.
Docstring Coverage	✅ Passed	Docstring coverage is 95.65% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tests/integration/defs/perf/disagg/utils/common.py (1)
54-59: Typos in default placeholder values.

The default values contain typos: "You" should be "Your".
🔎 Proposed fix for typos
     @staticmethod
     def get_slurm_partition() -> str:
-        return os.getenv("SLURM_PARTITION", "<You slurm partition>")
+        return os.getenv("SLURM_PARTITION", "<Your slurm partition>")

     @staticmethod
     def get_slurm_account() -> str:
-        return os.getenv("SLURM_ACCOUNT", "<You slurm account>")
+        return os.getenv("SLURM_ACCOUNT", "<Your slurm account>")
tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml (1)
67-81: Duplicate batch size in CUDA graph configuration.

The batch_sizes list contains 1024 twice (lines 79 and 81). This duplicate may cause unnecessary CUDA graph compilation overhead.
🔎 Proposed fix to remove duplicate
       batch_sizes:
       - 1
       - 2
       - 4
       - 8
       - 16
       - 32
       - 64
       - 128
       - 256
       - 512
       - 768
       - 1024
       - 2048
-      - 1024
tests/integration/defs/perf/disagg/utils/trackers.py (1)
1-11: Missing NVIDIA copyright header.

Per the coding guidelines, all Python files should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.
🔎 Suggested fix
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 from datetime import datetime
tests/integration/defs/perf/disagg/compare_backends.py (1)

1-11: Missing NVIDIA copyright header.

Per the coding guidelines, all Python files should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.

🧹 Nitpick comments (11)

tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md (1)
218-220: Add language specifiers to fenced code blocks.

Several code blocks are missing language specifiers, which affects rendering and syntax highlighting. Per static analysis hints:

Line 218: model_args_extra example

Line 238: Test execution flow diagram

Line 265: Log directory structure

Line 276: CSV output path

Line 284: Error directory example
🔎 Proposed fix for code block language specifiers
 **Common `model_args_extra` parameters:**
-```
+```text
 num_concurrent=512,max_retries=3,tokenized_requests=false,timeout=1200,max_gen_toks=256,max_length=4096
```diff
 ## Test Execution Flow
 
-```
+```text
 1. Configuration Validation
    ↓
 ...
 ### Log Directory
-```
+```text
 {OUTPUT_PATH}/slurm_logs/disagg_stress_{test_id}/
 ...
 ### CSV Output
-```
+```text
 {OUTPUT_PATH}/perf_script_test_results.csv
```diff
 ### Failed Test Directories
 Failed tests are automatically renamed with `_ERROR` suffix:
-```
+```text
 disagg_stress_{test_id}_ERROR/
</details>

</blockquote></details>
<details>
<summary>tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml (1)</summary><blockquote>

`1-11`: **Consider adding the nvbugs reference for consistency.**

Other similar configuration files (e.g., `deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml`) include a `# nvbugs: 5422621` comment at the top. Adding this reference would maintain consistency across the v32 configuration files.

<details>
<summary>🔎 Suggested fix</summary>

```diff
+# nvbugs: 5422621
 metadata:
   model_name: deepseek-v32-fp4
   precision: fp4
tests/integration/defs/perf/disagg/utils/trackers.py (1)
92-99: Consider capturing and logging the wait_for_completion result.

The return value of wait_for_completion is currently ignored. If the job times out or fails early, this information could be valuable for debugging. Consider capturing and logging the result even if the log file check remains the definitive success indicator.
🔎 Suggested improvement
         # Wait for job completion (reuses wait_for_completion method)
         logger.info(f"Waiting for session collect job {job_id} to complete...")
-        JobManager.wait_for_completion(
+        wait_result = JobManager.wait_for_completion(
             job_id=job_id,
             timeout=7200,  # 2 hours
             test_config=None,  # No test config for session collect
             check_early_failure=False  # Don't check early failures
         )
+        if not wait_result:
+            logger.warning(f"Session collect job {job_id} may have timed out or failed")
tests/integration/defs/perf/disagg/session_collect.sh (1)
35-43: Consider adding a warning when no wheel files are found.

The loop will silently complete even if no wheel files match the pattern, which could mask configuration errors. Adding feedback when no wheels are found would improve debuggability.
🔎 Suggested improvement
     # Expand wildcard and install
+    wheel_found=false
     for wheel_file in $WHEEL_PATH; do
         if [ -f "$wheel_file" ]; then
             echo "Found wheel: $wheel_file"
             pip3 install "$wheel_file" 2>&1 || echo "Wheel install failed, continuing..."
+            wheel_found=true
             break
         fi
     done
+    if [ "$wheel_found" = false ]; then
+        echo "WARNING: No wheel files found matching pattern: $WHEEL_PATH"
+    fi
     echo "Wheel installation completed"
tests/integration/defs/perf/disagg/compare_backends.py (3)
90-90: Rename unused loop variable to indicate intentional disuse.

The base_case variable is extracted from the group but not used within the loop body. Rename it to _base_case to indicate it's intentionally unused.
🔎 Suggested fix
-    for (base_case, metric_type), group in grouped:
+    for (_base_case, metric_type), group in grouped:
160-171: Remove extraneous f-prefixes from strings without placeholders.

These strings don't contain any format placeholders, so the f prefix is unnecessary.
🔎 Suggested fix
     # Print statistics
-    print(f"\n=== Backend Comparison Statistics ===")
+    print("\n=== Backend Comparison Statistics ===")
     print(f"Default backend: {default_backend}")
     print(f"Comparison pairs: {comparison_pairs}")
     print(f"Single-backend cases (skipped): {single_backend_skipped}")
-    print("=" * 37)
+    print("=" * 37)

     # If no comparison pairs found, exit with success
     if comparison_pairs == 0:
-        print(f"\nInfo: No backend comparison pairs found in disagg_perf tests")
-        print(f"All cases are single-backend only, no comparison needed")
+        print("\nInfo: No backend comparison pairs found in disagg_perf tests")
+        print("All cases are single-backend only, no comparison needed")
         sys.exit(0)
108-115: Redundant length checks after early-exit logic.

Since lines 101-103 already ensure both default_data and ucx_data have data (the single-backend cases are skipped), the length checks at lines 109 and 112 will always evaluate to True. The conditional expressions can be simplified.
🔎 Suggested fix
         # Extract values and original test names
-        default_value = default_data["perf_metric"].values[0] if len(default_data) > 0 else None
-        default_original_name = (
-            default_data["network_name"].values[0] if len(default_data) > 0 else None
-        )
+        default_value = default_data["perf_metric"].values[0]
+        default_original_name = default_data["network_name"].values[0]

-        ucx_value = ucx_data["perf_metric"].values[0] if len(ucx_data) > 0 else None
-        ucx_original_name = ucx_data["network_name"].values[0] if len(ucx_data) > 0 else None
+        ucx_value = ucx_data["perf_metric"].values[0]
+        ucx_original_name = ucx_data["network_name"].values[0]
Note: If you keep the redundant checks for defensive coding, you can also remove the None check at line 123 since both values are guaranteed to be non-None at that point (unless the CSV contains actual None/NaN values in the perf_metric column, which is a different concern).
tests/integration/defs/perf/disagg/utils/config_loader.py (1)
411-411: Remove unnecessary f-string prefix.

The f-string has no placeholders. Remove the f prefix.
🔎 Proposed fix
-                    logger.info(f"Using custom accuracy metrics config from YAML")
+                    logger.info("Using custom accuracy metrics config from YAML")
tests/integration/defs/perf/disagg/test_disagg.py (1)
290-292: Consider using bare raise instead of raise e.

Using raise without the exception name preserves the original traceback, which is preferred for re-raising exceptions. However, this pattern is consistent with test_benchmark (line 126) and test_accuracy (line 206), so this is optional to change.
🔎 Proposed fix
         except Exception as e:
             test_tracker.end_test_case()
-            raise e
+            raise
tests/integration/defs/perf/disagg/execution/executor.py (2)
32-40: Use explicit Optional[] for Python 3.8+ compatibility.

Per coding guidelines, code should conform to Python 3.8+. The = None default without explicit Optional type annotation is valid in Python 3.10+ but can cause issues with type checkers in earlier versions.
🔎 Proposed fix
     @staticmethod
     def submit_shell_job(
         job_name: str,
         script_path: str,
-        script_args: list[str] = None,
-        output_log_file: str = None,
+        script_args: Optional[list[str]] = None,
+        output_log_file: Optional[str] = None,
         timeout: int = 7200,
-        container_name: str = None
+        container_name: Optional[str] = None
     ) -> tuple[bool, str]:
Based on coding guidelines: "Code developed for TensorRT-LLM should conform to Python 3.8+"

235-238: Whitespace inconsistency at line 235.

There's trailing whitespace and inconsistent indentation at line 235.
🔎 Proposed fix
-            submit_script = os.path.join(EnvManager.get_script_dir(), "submit.py")
-           
-            case_log_dir = JobManager.get_result_dir(test_config)
-
-            cmd = ["python3", submit_script, "-c", temp_config_path, "--log-dir", case_log_dir]
+            submit_script = os.path.join(EnvManager.get_script_dir(), "submit.py")
+            case_log_dir = JobManager.get_result_dir(test_config)
+            cmd = ["python3", submit_script, "-c", temp_config_path, "--log-dir", case_log_dir]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4944192 and 8dcb8dc.

📒 Files selected for processing (22)

tests/integration/defs/perf/disagg/compare_backends.py
tests/integration/defs/perf/disagg/execution/executor.py
tests/integration/defs/perf/disagg/session_collect.sh
tests/integration/defs/perf/disagg/simple_collect.py
tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md
tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/deepseek-r1-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_disagg.py
tests/integration/defs/perf/disagg/testlist/all.txt
tests/integration/defs/perf/disagg/testlist/disagg.txt
tests/integration/defs/perf/disagg/testlist/wideep.txt
tests/integration/defs/perf/disagg/utils/common.py
tests/integration/defs/perf/disagg/utils/config_loader.py
tests/integration/defs/perf/disagg/utils/trackers.py

💤 Files with no reviewable changes (1)

tests/integration/defs/perf/disagg/simple_collect.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

tests/integration/defs/perf/disagg/utils/config_loader.py
tests/integration/defs/perf/disagg/compare_backends.py
tests/integration/defs/perf/disagg/utils/common.py
tests/integration/defs/perf/disagg/utils/trackers.py
tests/integration/defs/perf/disagg/test_disagg.py
tests/integration/defs/perf/disagg/execution/executor.py

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

tests/integration/defs/perf/disagg/utils/config_loader.py
tests/integration/defs/perf/disagg/compare_backends.py
tests/integration/defs/perf/disagg/utils/common.py
tests/integration/defs/perf/disagg/utils/trackers.py
tests/integration/defs/perf/disagg/test_disagg.py
tests/integration/defs/perf/disagg/execution/executor.py

🧠 Learnings (2)

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml

📚 Learning: 2025-08-18T08:42:02.640Z

Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml

🧬 Code graph analysis (3)

tests/integration/defs/perf/disagg/utils/trackers.py (3)

tests/integration/defs/perf/disagg/execution/executor.py (3)

JobManager (25-860)

submit_session_collect_job (151-193)

wait_for_completion (476-535)

tests/integration/defs/perf/disagg/utils/common.py (2)

EnvManager (46-175)

get_output_path (125-132)

tests/integration/defs/perf/disagg/utils/logger.py (3)

info (81-83)

success (97-99)

error (89-91)

tests/integration/defs/perf/disagg/test_disagg.py (4)

tests/integration/defs/perf/disagg/execution/executor.py (5)

submit_test_job (200-270)

wait_for_completion (476-535)

check_result (353-400)

get_result_dir (338-350)

backup_logs (273-335)

tests/integration/defs/perf/disagg/utils/config_loader.py (2)

TestConfig (173-191)

display_name (189-191)

tests/integration/defs/perf/disagg/utils/trackers.py (4)

TestCaseTracker (14-61)

start_test_case (22-28)

end_test_case (30-38)

get_timestamps (40-55)

tests/integration/defs/perf/disagg/utils/common.py (3)

EnvManager (46-175)

get_debug_mode (170-171)

get_debug_job_id (174-175)

tests/integration/defs/perf/disagg/execution/executor.py (3)

tests/integration/defs/perf/disagg/reporting/report.py (1)

LogParser (31-238)

tests/integration/defs/perf/disagg/utils/common.py (11)

EnvManager (46-175)

get_container_image (97-98)

get_container_mount (139-167)

get_output_path (125-132)

get_slurm_partition (54-55)

get_slurm_account (58-59)

get_slurm_extra_args (77-94)

get_work_dir (105-106)

get_repo_dir (109-110)

get_install_mode (135-136)

get_trtllm_wheel_path (113-114)

tests/integration/defs/perf/disagg/execution/subprocess_utils.py (2)

exec_cmd (18-33)

exec_cmd_with_output (36-63)

🪛 markdownlint-cli2 (0.18.1)

tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md

218-218: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

238-238: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

265-265: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

276-276: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

284-284: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)

tests/integration/defs/perf/disagg/utils/config_loader.py

411-411: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/integration/defs/perf/disagg/compare_backends.py

90-90: Loop control variable base_case not used within loop body

Rename unused base_case to _base_case

(B007)

161-161: f-string without any placeholders

Remove extraneous f prefix

(F541)

169-169: f-string without any placeholders

Remove extraneous f prefix

(F541)

170-170: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/integration/defs/perf/disagg/test_disagg.py

230-230: Do not catch blind exception: Exception

(BLE001)

292-292: Use raise without specifying exception name

Remove exception name

(TRY201)

300-300: Do not catch blind exception: Exception

(BLE001)

tests/integration/defs/perf/disagg/execution/executor.py

36-36: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

37-37: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

39-39: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

138-138: Consider moving this statement to an else block

(TRY300)

140-140: Do not catch blind exception: Exception

(BLE001)

189-189: Do not catch blind exception: Exception

(BLE001)

323-323: Consider moving this statement to an else block

(TRY300)

325-325: Do not catch blind exception: Exception

(BLE001)

🪛 Shellcheck (0.11.0)

tests/integration/defs/perf/disagg/session_collect.sh

[warning] 21-21: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

[warning] 47-47: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (24)

tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/deepseek-r1-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml (1)

22-22: LGTM!

The job time increase to 3 hours is reasonable for accuracy tests that include GSM8K evaluation alongside benchmarking.

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yaml (1)

1-110: LGTM!

New performance configuration for DeepSeek-V3.2-FP4 with 8k input/1k output benchmark. Configuration is well-structured with appropriate resource allocation for 6 context servers and proper NIXL cache transceiver backend.

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yaml (1)

1-110: LGTM!

New performance configuration for DeepSeek-V3.2-FP4 with 1k1k benchmark. Configuration is consistent with naming conventions and properly configured for NIXL cache transceiver backend.

tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml (1)

1-2: LGTM on stress test structure.

The configuration correctly includes both metadata.accuracy and top-level accuracy sections as required for stress tests. The nvbugs reference (5422621) provides traceability.

tests/integration/defs/perf/disagg/utils/common.py (2)

6-43: Good refactor: Centralized GPU resource configuration.

The consolidation of GPU-specific SLURM parameters into a single GPU_RESOURCE_CONFIG dictionary improves maintainability and makes it easier to add new GPU types or modify existing configurations.

65-94: Well-documented methods with Google-style docstrings.

The updated get_slurm_set_segment() and get_slurm_extra_args() methods include comprehensive docstrings following Google style conventions, with clear descriptions, return types, and examples.

tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml (1)

22-22: Significant job time increase.

Job time increased from 45 minutes to 3 hours. This is reasonable for accuracy tests that require more time for GSM8K evaluation, but ensure the cluster scheduling policy accommodates longer jobs.

tests/integration/defs/perf/disagg/testlist/disagg.txt (1)

27-27: LGTM!

The new stress test entry follows the established naming convention and correctly references the test_stress method with the DEFAULT backend configuration.

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml (1)

10-12: LGTM!

The benchmark type and dataset file changes are internally consistent—the 1k1k benchmark type correctly maps to the deepseek-r1-1024-1024-100000-ratio-1_for_serve.json dataset file, matching the input_length: 1024 and output_length: 1024 settings.

tests/integration/defs/perf/disagg/testlist/wideep.txt (1)

11-16: LGTM!

The new DeepSeek V3.2 FP4 test entries follow the established naming convention and provide appropriate coverage across different benchmark configurations (8k1k, 1k1k) and backends (DEFAULT, NIXL).

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml (1)

1-110: LGTM!

The new DeepSeek V3.2 FP4 configuration is well-structured and consistent with existing configs. The DEFAULT backend setting in both gen and ctx cache_transceiver_config sections correctly aligns with the filename convention.

tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yaml (1)

1-116: Configuration looks complete and well-structured.

The YAML configuration follows the established pattern for wideep perf tests. A few observations:

The cuda_graph_config for ctx (line 105) is set to null, which disables CUDA graphs for context servers - this is intentional for context-only workloads.

The num_ctx_servers: 8 with num_gen_servers: 1 ratio matches the filename convention (ctx8_gen1).

MTP3 speculative decoding is correctly configured for both gen and ctx workers.

tests/integration/defs/perf/disagg/testlist/all.txt (2)

29-31: New stress test entry correctly references the test_stress method.

The stress test case at line 30 correctly uses test_stress[...] which aligns with the new test_stress method added to TestDisaggBenchmark in test_disagg.py. The naming convention follows the established pattern.

44-49: New wideep v32 test entries are well-formed.

The new deepseek-v32-fp4 test cases follow the established naming convention and correctly use test_benchmark[...] for performance tests. The variety of configurations (ctx2/ctx6/ctx8, NIXL/DEFAULT backends, different batch sizes) provides good coverage.

tests/integration/defs/perf/disagg/utils/config_loader.py (3)

48-48: Good addition of optional metrics configuration.

The optional metrics field with None default allows per-test custom metrics configuration while maintaining backward compatibility with existing accuracy tests.

71-80: Clean fallback pattern for metrics configuration.

The get_metrics_config() method provides a sensible default fallback to _COMMON_ACCURACY_METRICS when no custom metrics are provided, enabling both flexibility and convenience.

363-414: Unified handling for accuracy and stress test categories is well-implemented.

The logic correctly treats "stress" test category similar to "accuracy" for loading accuracy configurations, while also supporting optional custom metrics overrides. The conditional at line 364 cleanly extends the existing pattern.

tests/integration/defs/perf/disagg/test_disagg.py (3)

17-25: Stress test configuration filtering follows established patterns.

The filtering logic for STRESS_TEST_CONFIGS and conversion to STRESS_TEST_CASES follows the same clean pattern used for performance and accuracy tests.

105-105: Method rename from submit_job to submit_test_job is consistent.

The rename aligns with the updated JobManager API in executor.py.

217-301: New stress test method is well-structured.

The test_stress method correctly combines performance benchmarking with accuracy validation. The implementation follows the established patterns from test_benchmark and test_accuracy, including proper tracker usage, configuration validation, and log backup in the finally block.

tests/integration/defs/perf/disagg/execution/executor.py (4)

82-88: The srun command construction looks correct.

The command properly sets up container execution with the necessary mounts and image configuration. The bash command is correctly quoted to handle arguments with spaces.

150-193: Clean delegation pattern for session collection job.

The submit_session_collect_job method appropriately delegates to the generic submit_shell_job, passing the required environment-derived arguments. This promotes code reuse and consistency.

293-323: Good improvement to backup_logs for failed test identification.

The _ERROR suffix for failed test directories is a practical improvement for quickly identifying failed tests when browsing log directories. The cleanup of existing error directories before renaming is appropriate to avoid conflicts.

801-849: Stress test result checking correctly combines perf and accuracy validation.

The implementation properly:

Runs perf validation first and fails fast if it doesn't pass

Only proceeds to accuracy validation if perf passes

Uses accuracy_config.get_metrics_config() to get the appropriate metrics configuration

Merges results appropriately for both success and failure cases

One consideration: if accuracy_config is None for a stress test, only a warning is logged. This may be intentional for flexibility, but you might want to make accuracy config required for stress tests.

Please verify that all stress test configurations include an accuracy section in their YAML metadata. If accuracy validation is always expected for stress tests, consider making accuracy_config required instead of optional with a warning.

tests/integration/defs/perf/disagg/session_collect.sh

...onfigs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml

tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md

...onfigs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yaml

tests/integration/defs/perf/disagg/test_disagg.py

tests/integration/defs/perf/disagg/utils/common.py

fredricz-20070104 · 2025-12-30T02:09:16Z

/bot run --skip-test

tensorrt-cicd · 2025-12-30T02:15:19Z

PR_Github #30109 [ run ] triggered by Bot. Commit: 53c5e9a

tensorrt-cicd · 2025-12-30T02:55:31Z

PR_Github #30109 [ run ] completed with state FAILURE. Commit: 53c5e9a
/LLM/main/L0_MergeRequest_PR pipeline #23170 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

fredricz-20070104 · 2025-12-30T06:23:53Z

/bot run --skip-test

tensorrt-cicd · 2025-12-30T06:30:03Z

PR_Github #30137 [ run ] triggered by Bot. Commit: f7e3db6

tensorrt-cicd · 2025-12-30T07:12:32Z

PR_Github #30137 [ run ] completed with state FAILURE. Commit: f7e3db6
/LLM/main/L0_MergeRequest_PR pipeline #23191 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

fredricz-20070104 · 2025-12-31T03:19:53Z

/bot run --skip-test

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

… backend result for comparison Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 changed the title ~~[None][test] Unified slurm extra args management and session collection logic~~ [None][chore] Unified slurm extra args management and session collection logic Dec 30, 2025

fredricz-20070104 changed the title ~~[None][chore] Unified slurm extra args management and session collection logic~~ [None][test] Unified slurm extra args management and session collection logic Dec 30, 2025

fredricz-20070104 force-pushed the feature/fz_sbatch branch from 8dcb8dc to 53c5e9a Compare December 30, 2025 02:01

fredricz-20070104 enabled auto-merge (squash) December 30, 2025 02:01

fredricz-20070104 assigned yingguo-trt Dec 30, 2025

coderabbitai bot reviewed Dec 30, 2025

View reviewed changes

yingguo-trt approved these changes Dec 30, 2025

View reviewed changes

fredricz-20070104 force-pushed the feature/fz_sbatch branch from 53c5e9a to 2b03e2a Compare December 30, 2025 06:22

fredricz-20070104 force-pushed the feature/fz_sbatch branch from f7e3db6 to 01617ba Compare December 31, 2025 05:37

fredricz-20070104 added 10 commits December 31, 2025 13:38

fix slurm log path error

cf6326f

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fix default backend issue - adapt for the current logic - skip single…

eabfb49

… backend result for comparison Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fx

2a119f6

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

add stress test cases here

6aa580f

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fix conflict

cb50e9b

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fix srun logic to sbatch

dd7f984

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fx

a0c3167

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

try to fix sbatch isssue

0f36e28

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

expand the logic of slurm extra args

775b205

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fx

d5f3f35

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 force-pushed the feature/fz_sbatch branch from 01617ba to d5f3f35 Compare December 31, 2025 05:38

[None][test] Unified slurm extra args management and session collection logic #10332

Are you sure you want to change the base?

[None][test] Unified slurm extra args management and session collection logic #10332

Conversation

fredricz-20070104 commented Dec 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 30, 2025

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fredricz-20070104 commented Dec 30, 2025

Uh oh!

tensorrt-cicd commented Dec 30, 2025

Uh oh!

tensorrt-cicd commented Dec 30, 2025

Uh oh!

fredricz-20070104 commented Dec 30, 2025

Uh oh!

tensorrt-cicd commented Dec 30, 2025

Uh oh!

tensorrt-cicd commented Dec 30, 2025

Uh oh!

fredricz-20070104 commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fredricz-20070104 commented Dec 30, 2025 •

edited by coderabbitai bot

Loading