Ko3n1g/pranav/ray k8s issue debug #562

ko3n1g · 2026-01-19T17:34:21Z

We shouldn't merge it like this, I think this will have negative side-effects on Slurm deployments. But on Kubernetes with the KubeRay operator, this solves the init hang.

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot · 2026-01-19T17:34:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: oliver könig <okoenig@nvidia.com>

pthombre · 2026-01-20T19:04:47Z

nemo_deploy/llm/megatronllm_deployable_ray.py

        **model_config_kwargs,
    ):
        # Use replica-specific environment variables to avoid conflicts
+        master_addr = "127.0.0.1"


What is the need to hard code the master address and port here?

Signed-off-by: oliver könig <okoenig@nvidia.com>

pthombre and others added 2 commits January 14, 2026 10:24

Temp fix for k8s issue

bfc9c0c

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

fix: Pin master addr

e3b0018

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g requested review from athitten, oyilmaz-nvidia and pthombre as code owners January 19, 2026 17:34

github-actions bot added deploy LLM labels Jan 19, 2026

ko3n1g added 2 commits January 19, 2026 17:42

add delay between workers

51af26a

Signed-off-by: oliver könig <okoenig@nvidia.com>

socket

4b86332

Signed-off-by: oliver könig <okoenig@nvidia.com>

pthombre reviewed Jan 20, 2026

View reviewed changes

cleanup

7ed9663

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g force-pushed the ko3n1g/pranav/ray_k8s_issue_debug branch from a6c94ee to 7ed9663 Compare January 22, 2026 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ko3n1g/pranav/ray k8s issue debug #562

Ko3n1g/pranav/ray k8s issue debug #562

Uh oh!

ko3n1g commented Jan 19, 2026

Uh oh!

copy-pr-bot bot commented Jan 19, 2026

Uh oh!

pthombre Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ko3n1g/pranav/ray k8s issue debug #562

Are you sure you want to change the base?

Ko3n1g/pranav/ray k8s issue debug #562

Uh oh!

Conversation

ko3n1g commented Jan 19, 2026

Uh oh!

copy-pr-bot bot commented Jan 19, 2026

Uh oh!

pthombre Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants