-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][test] Unified slurm extra args management and session collection logic #10332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[None][test] Unified slurm extra args management and session collection logic #10332
Conversation
8dcb8dc to
53c5e9a
Compare
📝 WalkthroughWalkthroughThis PR extends the disaggregated test framework to support stress testing, refactors job submission via a new JobManager class, introduces shell-based session collection, and expands test configurations for stress and performance benchmarking scenarios across multiple model variants. Changes
Sequence DiagramssequenceDiagram
participant Test as test_stress()
participant TM as TestCaseTracker
participant JM as JobManager
participant SLURM as sbatch
participant PC as Performance Checker
participant AC as Accuracy Checker
Test->>TM: track_test(test_config)
Test->>JM: submit_test_job(test_config)
JM->>SLURM: sbatch submit.py
SLURM-->>JM: job_id
JM->>SLURM: wait_for_completion(job_id)
SLURM-->>JM: complete
Test->>JM: check_result(job_id, test_config,<br/>test_category="stress")
Note over JM: Routes to _check_job_result<br/>with "stress" handling
JM->>PC: _check_perf_result()
PC-->>JM: perf_results
alt perf_success
JM->>AC: _check_accuracy_result()
AC-->>JM: accuracy_results
JM-->>Test: combined perf+accuracy
else perf_failure
JM-->>Test: combined results<br/>with error details
end
Test->>JM: backup_logs(final_dir)
JM-->>Test: complete
TM->>TM: record result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tests/integration/defs/perf/disagg/utils/common.py (1)
54-59: Typos in default placeholder values.The default values contain typos: "You" should be "Your".
🔎 Proposed fix for typos
@staticmethod def get_slurm_partition() -> str: - return os.getenv("SLURM_PARTITION", "<You slurm partition>") + return os.getenv("SLURM_PARTITION", "<Your slurm partition>") @staticmethod def get_slurm_account() -> str: - return os.getenv("SLURM_ACCOUNT", "<You slurm account>") + return os.getenv("SLURM_ACCOUNT", "<Your slurm account>")tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml (1)
67-81: Duplicate batch size in CUDA graph configuration.The
batch_sizeslist contains1024twice (lines 79 and 81). This duplicate may cause unnecessary CUDA graph compilation overhead.🔎 Proposed fix to remove duplicate
batch_sizes: - 1 - 2 - 4 - 8 - 16 - 32 - 64 - 128 - 256 - 512 - 768 - 1024 - 2048 - - 1024tests/integration/defs/perf/disagg/utils/trackers.py (1)
1-11: Missing NVIDIA copyright header.Per the coding guidelines, all Python files should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.
🔎 Suggested fix
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import os from datetime import datetimetests/integration/defs/perf/disagg/compare_backends.py (1)
1-11: Missing NVIDIA copyright header.Per the coding guidelines, all Python files should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.
🧹 Nitpick comments (11)
tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md (1)
218-220: Add language specifiers to fenced code blocks.Several code blocks are missing language specifiers, which affects rendering and syntax highlighting. Per static analysis hints:
- Line 218:
model_args_extraexample- Line 238: Test execution flow diagram
- Line 265: Log directory structure
- Line 276: CSV output path
- Line 284: Error directory example
🔎 Proposed fix for code block language specifiers
**Common `model_args_extra` parameters:** -``` +```text num_concurrent=512,max_retries=3,tokenized_requests=false,timeout=1200,max_gen_toks=256,max_length=4096```diff ## Test Execution Flow -``` +```text 1. Configuration Validation ↓ ...### Log Directory -``` +```text {OUTPUT_PATH}/slurm_logs/disagg_stress_{test_id}/ ...### CSV Output -``` +```text {OUTPUT_PATH}/perf_script_test_results.csv```diff ### Failed Test Directories Failed tests are automatically renamed with `_ERROR` suffix: -``` +```text disagg_stress_{test_id}_ERROR/</details> </blockquote></details> <details> <summary>tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml (1)</summary><blockquote> `1-11`: **Consider adding the nvbugs reference for consistency.** Other similar configuration files (e.g., `deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml`) include a `# nvbugs: 5422621` comment at the top. Adding this reference would maintain consistency across the v32 configuration files. <details> <summary>🔎 Suggested fix</summary> ```diff +# nvbugs: 5422621 metadata: model_name: deepseek-v32-fp4 precision: fp4tests/integration/defs/perf/disagg/utils/trackers.py (1)
92-99: Consider capturing and logging thewait_for_completionresult.The return value of
wait_for_completionis currently ignored. If the job times out or fails early, this information could be valuable for debugging. Consider capturing and logging the result even if the log file check remains the definitive success indicator.🔎 Suggested improvement
# Wait for job completion (reuses wait_for_completion method) logger.info(f"Waiting for session collect job {job_id} to complete...") - JobManager.wait_for_completion( + wait_result = JobManager.wait_for_completion( job_id=job_id, timeout=7200, # 2 hours test_config=None, # No test config for session collect check_early_failure=False # Don't check early failures ) + if not wait_result: + logger.warning(f"Session collect job {job_id} may have timed out or failed")tests/integration/defs/perf/disagg/session_collect.sh (1)
35-43: Consider adding a warning when no wheel files are found.The loop will silently complete even if no wheel files match the pattern, which could mask configuration errors. Adding feedback when no wheels are found would improve debuggability.
🔎 Suggested improvement
# Expand wildcard and install + wheel_found=false for wheel_file in $WHEEL_PATH; do if [ -f "$wheel_file" ]; then echo "Found wheel: $wheel_file" pip3 install "$wheel_file" 2>&1 || echo "Wheel install failed, continuing..." + wheel_found=true break fi done + if [ "$wheel_found" = false ]; then + echo "WARNING: No wheel files found matching pattern: $WHEEL_PATH" + fi echo "Wheel installation completed"tests/integration/defs/perf/disagg/compare_backends.py (3)
90-90: Rename unused loop variable to indicate intentional disuse.The
base_casevariable is extracted from the group but not used within the loop body. Rename it to_base_caseto indicate it's intentionally unused.🔎 Suggested fix
- for (base_case, metric_type), group in grouped: + for (_base_case, metric_type), group in grouped:
160-171: Remove extraneous f-prefixes from strings without placeholders.These strings don't contain any format placeholders, so the
fprefix is unnecessary.🔎 Suggested fix
# Print statistics - print(f"\n=== Backend Comparison Statistics ===") + print("\n=== Backend Comparison Statistics ===") print(f"Default backend: {default_backend}") print(f"Comparison pairs: {comparison_pairs}") print(f"Single-backend cases (skipped): {single_backend_skipped}") - print("=" * 37) + print("=" * 37) # If no comparison pairs found, exit with success if comparison_pairs == 0: - print(f"\nInfo: No backend comparison pairs found in disagg_perf tests") - print(f"All cases are single-backend only, no comparison needed") + print("\nInfo: No backend comparison pairs found in disagg_perf tests") + print("All cases are single-backend only, no comparison needed") sys.exit(0)
108-115: Redundant length checks after early-exit logic.Since lines 101-103 already ensure both
default_dataanducx_datahave data (the single-backend cases are skipped), the length checks at lines 109 and 112 will always evaluate toTrue. The conditional expressions can be simplified.🔎 Suggested fix
# Extract values and original test names - default_value = default_data["perf_metric"].values[0] if len(default_data) > 0 else None - default_original_name = ( - default_data["network_name"].values[0] if len(default_data) > 0 else None - ) + default_value = default_data["perf_metric"].values[0] + default_original_name = default_data["network_name"].values[0] - ucx_value = ucx_data["perf_metric"].values[0] if len(ucx_data) > 0 else None - ucx_original_name = ucx_data["network_name"].values[0] if len(ucx_data) > 0 else None + ucx_value = ucx_data["perf_metric"].values[0] + ucx_original_name = ucx_data["network_name"].values[0]Note: If you keep the redundant checks for defensive coding, you can also remove the
Nonecheck at line 123 since both values are guaranteed to be non-None at that point (unless the CSV contains actualNone/NaNvalues in theperf_metriccolumn, which is a different concern).tests/integration/defs/perf/disagg/utils/config_loader.py (1)
411-411: Remove unnecessary f-string prefix.The f-string has no placeholders. Remove the
fprefix.🔎 Proposed fix
- logger.info(f"Using custom accuracy metrics config from YAML") + logger.info("Using custom accuracy metrics config from YAML")tests/integration/defs/perf/disagg/test_disagg.py (1)
290-292: Consider using bareraiseinstead ofraise e.Using
raisewithout the exception name preserves the original traceback, which is preferred for re-raising exceptions. However, this pattern is consistent withtest_benchmark(line 126) andtest_accuracy(line 206), so this is optional to change.🔎 Proposed fix
except Exception as e: test_tracker.end_test_case() - raise e + raisetests/integration/defs/perf/disagg/execution/executor.py (2)
32-40: Use explicitOptional[]for Python 3.8+ compatibility.Per coding guidelines, code should conform to Python 3.8+. The
= Nonedefault without explicitOptionaltype annotation is valid in Python 3.10+ but can cause issues with type checkers in earlier versions.🔎 Proposed fix
@staticmethod def submit_shell_job( job_name: str, script_path: str, - script_args: list[str] = None, - output_log_file: str = None, + script_args: Optional[list[str]] = None, + output_log_file: Optional[str] = None, timeout: int = 7200, - container_name: str = None + container_name: Optional[str] = None ) -> tuple[bool, str]:Based on coding guidelines: "Code developed for TensorRT-LLM should conform to Python 3.8+"
235-238: Whitespace inconsistency at line 235.There's trailing whitespace and inconsistent indentation at line 235.
🔎 Proposed fix
- submit_script = os.path.join(EnvManager.get_script_dir(), "submit.py") - - case_log_dir = JobManager.get_result_dir(test_config) - - cmd = ["python3", submit_script, "-c", temp_config_path, "--log-dir", case_log_dir] + submit_script = os.path.join(EnvManager.get_script_dir(), "submit.py") + case_log_dir = JobManager.get_result_dir(test_config) + cmd = ["python3", submit_script, "-c", temp_config_path, "--log-dir", case_log_dir]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
tests/integration/defs/perf/disagg/compare_backends.pytests/integration/defs/perf/disagg/execution/executor.pytests/integration/defs/perf/disagg/session_collect.shtests/integration/defs/perf/disagg/simple_collect.pytests/integration/defs/perf/disagg/test_configs/disagg/stress/README.mdtests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yamltests/integration/defs/perf/disagg/test_configs/wideep/accuracy/deepseek-r1-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_disagg.pytests/integration/defs/perf/disagg/testlist/all.txttests/integration/defs/perf/disagg/testlist/disagg.txttests/integration/defs/perf/disagg/testlist/wideep.txttests/integration/defs/perf/disagg/utils/common.pytests/integration/defs/perf/disagg/utils/config_loader.pytests/integration/defs/perf/disagg/utils/trackers.py
💤 Files with no reviewable changes (1)
- tests/integration/defs/perf/disagg/simple_collect.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tests/integration/defs/perf/disagg/utils/config_loader.pytests/integration/defs/perf/disagg/compare_backends.pytests/integration/defs/perf/disagg/utils/common.pytests/integration/defs/perf/disagg/utils/trackers.pytests/integration/defs/perf/disagg/test_disagg.pytests/integration/defs/perf/disagg/execution/executor.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tests/integration/defs/perf/disagg/utils/config_loader.pytests/integration/defs/perf/disagg/compare_backends.pytests/integration/defs/perf/disagg/utils/common.pytests/integration/defs/perf/disagg/utils/trackers.pytests/integration/defs/perf/disagg/test_disagg.pytests/integration/defs/perf/disagg/execution/executor.py
🧠 Learnings (2)
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.
Applied to files:
tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yamltests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
🧬 Code graph analysis (3)
tests/integration/defs/perf/disagg/utils/trackers.py (3)
tests/integration/defs/perf/disagg/execution/executor.py (3)
JobManager(25-860)submit_session_collect_job(151-193)wait_for_completion(476-535)tests/integration/defs/perf/disagg/utils/common.py (2)
EnvManager(46-175)get_output_path(125-132)tests/integration/defs/perf/disagg/utils/logger.py (3)
info(81-83)success(97-99)error(89-91)
tests/integration/defs/perf/disagg/test_disagg.py (4)
tests/integration/defs/perf/disagg/execution/executor.py (5)
submit_test_job(200-270)wait_for_completion(476-535)check_result(353-400)get_result_dir(338-350)backup_logs(273-335)tests/integration/defs/perf/disagg/utils/config_loader.py (2)
TestConfig(173-191)display_name(189-191)tests/integration/defs/perf/disagg/utils/trackers.py (4)
TestCaseTracker(14-61)start_test_case(22-28)end_test_case(30-38)get_timestamps(40-55)tests/integration/defs/perf/disagg/utils/common.py (3)
EnvManager(46-175)get_debug_mode(170-171)get_debug_job_id(174-175)
tests/integration/defs/perf/disagg/execution/executor.py (3)
tests/integration/defs/perf/disagg/reporting/report.py (1)
LogParser(31-238)tests/integration/defs/perf/disagg/utils/common.py (11)
EnvManager(46-175)get_container_image(97-98)get_container_mount(139-167)get_output_path(125-132)get_slurm_partition(54-55)get_slurm_account(58-59)get_slurm_extra_args(77-94)get_work_dir(105-106)get_repo_dir(109-110)get_install_mode(135-136)get_trtllm_wheel_path(113-114)tests/integration/defs/perf/disagg/execution/subprocess_utils.py (2)
exec_cmd(18-33)exec_cmd_with_output(36-63)
🪛 markdownlint-cli2 (0.18.1)
tests/integration/defs/perf/disagg/test_configs/disagg/stress/README.md
218-218: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
238-238: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
265-265: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
276-276: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
284-284: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.10)
tests/integration/defs/perf/disagg/utils/config_loader.py
411-411: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/integration/defs/perf/disagg/compare_backends.py
90-90: Loop control variable base_case not used within loop body
Rename unused base_case to _base_case
(B007)
161-161: f-string without any placeholders
Remove extraneous f prefix
(F541)
169-169: f-string without any placeholders
Remove extraneous f prefix
(F541)
170-170: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/integration/defs/perf/disagg/test_disagg.py
230-230: Do not catch blind exception: Exception
(BLE001)
292-292: Use raise without specifying exception name
Remove exception name
(TRY201)
300-300: Do not catch blind exception: Exception
(BLE001)
tests/integration/defs/perf/disagg/execution/executor.py
36-36: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
37-37: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
39-39: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
138-138: Consider moving this statement to an else block
(TRY300)
140-140: Do not catch blind exception: Exception
(BLE001)
189-189: Do not catch blind exception: Exception
(BLE001)
323-323: Consider moving this statement to an else block
(TRY300)
325-325: Do not catch blind exception: Exception
(BLE001)
🪛 Shellcheck (0.11.0)
tests/integration/defs/perf/disagg/session_collect.sh
[warning] 21-21: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[warning] 47-47: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (24)
tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/deepseek-r1-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml (1)
22-22: LGTM!The job time increase to 3 hours is reasonable for accuracy tests that include GSM8K evaluation alongside benchmarking.
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yaml (1)
1-110: LGTM!New performance configuration for DeepSeek-V3.2-FP4 with 8k input/1k output benchmark. Configuration is well-structured with appropriate resource allocation for 6 context servers and proper NIXL cache transceiver backend.
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yaml (1)
1-110: LGTM!New performance configuration for DeepSeek-V3.2-FP4 with 1k1k benchmark. Configuration is consistent with naming conventions and properly configured for NIXL cache transceiver backend.
tests/integration/defs/perf/disagg/test_configs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml (1)
1-2: LGTM on stress test structure.The configuration correctly includes both
metadata.accuracyand top-levelaccuracysections as required for stress tests. The nvbugs reference (5422621) provides traceability.tests/integration/defs/perf/disagg/utils/common.py (2)
6-43: Good refactor: Centralized GPU resource configuration.The consolidation of GPU-specific SLURM parameters into a single
GPU_RESOURCE_CONFIGdictionary improves maintainability and makes it easier to add new GPU types or modify existing configurations.
65-94: Well-documented methods with Google-style docstrings.The updated
get_slurm_set_segment()andget_slurm_extra_args()methods include comprehensive docstrings following Google style conventions, with clear descriptions, return types, and examples.tests/integration/defs/perf/disagg/test_configs/wideep/accuracy/kimi-k2-thinking-fp4_1k1k_ctx3_gen1_dep32_bs1024_eplb384_mtp0_ccb-NIXL.yaml (1)
22-22: Significant job time increase.Job time increased from 45 minutes to 3 hours. This is reasonable for accuracy tests that require more time for GSM8K evaluation, but ensure the cluster scheduling policy accommodates longer jobs.
tests/integration/defs/perf/disagg/testlist/disagg.txt (1)
27-27: LGTM!The new stress test entry follows the established naming convention and correctly references the
test_stressmethod with the DEFAULT backend configuration.tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml (1)
10-12: LGTM!The benchmark type and dataset file changes are internally consistent—the 1k1k benchmark type correctly maps to the
deepseek-r1-1024-1024-100000-ratio-1_for_serve.jsondataset file, matching theinput_length: 1024andoutput_length: 1024settings.tests/integration/defs/perf/disagg/testlist/wideep.txt (1)
11-16: LGTM!The new DeepSeek V3.2 FP4 test entries follow the established naming convention and provide appropriate coverage across different benchmark configurations (8k1k, 1k1k) and backends (DEFAULT, NIXL).
tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml (1)
1-110: LGTM!The new DeepSeek V3.2 FP4 configuration is well-structured and consistent with existing configs. The DEFAULT backend setting in both
genandctxcache_transceiver_configsections correctly aligns with the filename convention.tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yaml (1)
1-116: Configuration looks complete and well-structured.The YAML configuration follows the established pattern for wideep perf tests. A few observations:
- The
cuda_graph_configfor ctx (line 105) is set tonull, which disables CUDA graphs for context servers - this is intentional for context-only workloads.- The
num_ctx_servers: 8withnum_gen_servers: 1ratio matches the filename convention (ctx8_gen1).- MTP3 speculative decoding is correctly configured for both gen and ctx workers.
tests/integration/defs/perf/disagg/testlist/all.txt (2)
29-31: New stress test entry correctly references thetest_stressmethod.The stress test case at line 30 correctly uses
test_stress[...]which aligns with the newtest_stressmethod added toTestDisaggBenchmarkintest_disagg.py. The naming convention follows the established pattern.
44-49: New wideep v32 test entries are well-formed.The new deepseek-v32-fp4 test cases follow the established naming convention and correctly use
test_benchmark[...]for performance tests. The variety of configurations (ctx2/ctx6/ctx8, NIXL/DEFAULT backends, different batch sizes) provides good coverage.tests/integration/defs/perf/disagg/utils/config_loader.py (3)
48-48: Good addition of optional metrics configuration.The optional
metricsfield withNonedefault allows per-test custom metrics configuration while maintaining backward compatibility with existing accuracy tests.
71-80: Clean fallback pattern for metrics configuration.The
get_metrics_config()method provides a sensible default fallback to_COMMON_ACCURACY_METRICSwhen no custom metrics are provided, enabling both flexibility and convenience.
363-414: Unified handling for accuracy and stress test categories is well-implemented.The logic correctly treats "stress" test category similar to "accuracy" for loading accuracy configurations, while also supporting optional custom metrics overrides. The conditional at line 364 cleanly extends the existing pattern.
tests/integration/defs/perf/disagg/test_disagg.py (3)
17-25: Stress test configuration filtering follows established patterns.The filtering logic for
STRESS_TEST_CONFIGSand conversion toSTRESS_TEST_CASESfollows the same clean pattern used for performance and accuracy tests.
105-105: Method rename fromsubmit_jobtosubmit_test_jobis consistent.The rename aligns with the updated
JobManagerAPI inexecutor.py.
217-301: New stress test method is well-structured.The
test_stressmethod correctly combines performance benchmarking with accuracy validation. The implementation follows the established patterns fromtest_benchmarkandtest_accuracy, including proper tracker usage, configuration validation, and log backup in the finally block.tests/integration/defs/perf/disagg/execution/executor.py (4)
82-88: The srun command construction looks correct.The command properly sets up container execution with the necessary mounts and image configuration. The bash command is correctly quoted to handle arguments with spaces.
150-193: Clean delegation pattern for session collection job.The
submit_session_collect_jobmethod appropriately delegates to the genericsubmit_shell_job, passing the required environment-derived arguments. This promotes code reuse and consistency.
293-323: Good improvement to backup_logs for failed test identification.The
_ERRORsuffix for failed test directories is a practical improvement for quickly identifying failed tests when browsing log directories. The cleanup of existing error directories before renaming is appropriate to avoid conflicts.
801-849: Stress test result checking correctly combines perf and accuracy validation.The implementation properly:
- Runs perf validation first and fails fast if it doesn't pass
- Only proceeds to accuracy validation if perf passes
- Uses
accuracy_config.get_metrics_config()to get the appropriate metrics configuration- Merges results appropriately for both success and failure cases
One consideration: if
accuracy_configisNonefor a stress test, only a warning is logged. This may be intentional for flexibility, but you might want to make accuracy config required for stress tests.Please verify that all stress test configurations include an
accuracysection in their YAML metadata. If accuracy validation is always expected for stress tests, consider makingaccuracy_configrequired instead of optional with a warning.
...onfigs/disagg/stress/deepseek-r1-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
Show resolved
Hide resolved
...onfigs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yaml
Show resolved
Hide resolved
|
/bot run --skip-test |
|
PR_Github #30109 [ run ] triggered by Bot. Commit: |
|
PR_Github #30109 [ run ] completed with state
|
53c5e9a to
2b03e2a
Compare
|
/bot run --skip-test |
|
PR_Github #30137 [ run ] triggered by Bot. Commit: |
|
PR_Github #30137 [ run ] completed with state
|
|
/bot run --skip-test |
f7e3db6 to
01617ba
Compare
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
… backend result for comparison Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
01617ba to
d5f3f35
Compare
Summary by CodeRabbit
Release Notes
New Features
Documentation
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.
a) Unified slurm extra args management
b) Unified session collection logic
c) Add deepseek v32 models