[#124] FEATURE: add support for multiple GPU nodes #113

jacksonjacobs1 · 2025-08-26T17:39:50Z

https://jacksonjjacobs.com/openproject/work_packages/124

jacksonjacobs1 · 2025-08-26T19:31:37Z

Description

The current start_dlproc method in quickannotator does not correctly manage GPU resources when multiple GPUs are available for a single ray actor:

def start_dlproc(self, allow_pred=True):
    if self.getProcRunningSince() is not None:
        self.logger.warning("Already running, not starting again")
        return

    self.logger.info(f"Starting up {build_actor_name(annotation_class_id=self.annotation_class_id)}")
    self.setProcRunningSince()

    total_gpus = ray.cluster_resources().get("GPU", 0)
    self.logger.info(f"Total GPUs available: {total_gpus}")
    scaling_config = ray.train.ScalingConfig(
        num_workers=int(total_gpus),
        use_gpu=True,
        resources_per_worker={"GPU": .01},
        placement_strategy="STRICT_SPREAD"
    )

    trainer = ray.train.torch.TorchTrainer(
        train_pred_loop,
        scaling_config=scaling_config,
        train_loop_config={
            'annotation_class_id': self.annotation_class_id,
            'tile_size': self.tile_size,
            'magnification': self.magnification
        }
    )
    self.hexid = trainer.fit().hex()
    self.logger.info(f"DLActor started with hexid: {self.hexid}")
    return self.hexid

Proposed solution

Configure the ray cluster as follows:

# causes ray to not modify the CUDA_VISIBLE_DEVICES — essentially allowing us to manage them ourselves
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0

# Explicitly manage the CUDA_VISIBLE_DEVICES. This isn't ideal but prevents an error "ValueError: '0' is not in list"
export CUDA_VISIBLE_DEVICES=0,1

Then in start_dlproc we do not need to set the placement strategy. The train_pred_loop function should be modified to look like this:

def trainpred_func2(config):
    print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
    print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
    model = resnet18(num_classes=10)
   
    cuda_dev=torch.device('cuda',ray.train.get_context().get_local_rank())
    model=ray.train.torch.prepare_model(model,cuda_dev)
    time.sleep(10)


scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})
trainer = ray.train.torch.TorchTrainer(trainpred_func2,scaling_config=scaling_config)
trainer.fit()

…t. Update readme

choosehappy

not sure if this was actually requested for review ; )

choosehappy · 2025-09-02T11:34:39Z

.devcontainer/devcontainer.json

+		"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "0",
+
+		// We set CUDA_VISIBLE_DEVICES here, as each container will need to set visible GPUs independently.
+		"CUDA_VISIBLE_DEVICES": "0,1"


is it possible to set this dynamically? what if someone has like, e.g., 10 GPUs, or only 1 GPU?

The hardcoded values were set for simplicity. This PR is not yet ready for review - I still need to test whether QA works with a multi-node, multi-gpu cluster.

That said, there are ways to set CUDA_VISIBLE_DEVICES dynamically:

Run export command after container setup (can be added to either Dockerfile, devcontainer.json, or docker compose file):

export CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=uuid --format=csv,noheader | paste -sd "," -)

Using nvidia-container-toolkit's API:
In devcontainer.json

{ "name": "My GPU Dev Container", "runArgs": ["--gpus=all"], "workspaceFolder": "/workspace" }

In docker compose yaml:

services: app: image: your-image runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=all

But it's unclear to me whether option 2 avoids the bug that you noticed with ray requiring CUDA_VISIBLE_DEVICES to be explicitly set:
ray-project/ray#49985 (comment)

If not, we could run the following command within the container to ensure CUDA_VISIBLE_DEVICES is set:

export CUDA_VISIBLE_DEVICES=$NVIDIA_VISIBLE_DEVICES

coo coo coo

choosehappy · 2025-09-02T11:47:48Z

README.md

    ```

+2. Modify `devcontainer.json` to suit your use case. Particularly, change the value of `CUDA_VISIBLE_DEVICES` to your desired GPU ids.
+


i see - are folks likely to read the readme in detail though? or perhaps we should have some explicit messages appear on the screen/log during bootup to draw their attention to these components?

[choosehappy#124] FEATURE: add support for multiple GPU nodes

71f2aa2

https://jacksonjjacobs.com/openproject/work_packages/124

jacksonjacobs1 added 2 commits August 26, 2025 19:53

Update devcontainer to configure ray cluster for proper GPU managemen…

c185ba3

…t. Update readme

Fix README to update argument for connecting to pre-existing ray cluster

1ce08f4

choosehappy reviewed Sep 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#124] FEATURE: add support for multiple GPU nodes #113

[#124] FEATURE: add support for multiple GPU nodes #113

Uh oh!

jacksonjacobs1 commented Aug 26, 2025

Uh oh!

jacksonjacobs1 commented Aug 26, 2025

Uh oh!

choosehappy left a comment

Uh oh!

choosehappy Sep 2, 2025

Uh oh!

jacksonjacobs1 Sep 2, 2025 •

edited

Loading

Uh oh!

choosehappy Sep 2, 2025

Uh oh!

choosehappy Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		```

		2. Modify `devcontainer.json` to suit your use case. Particularly, change the value of `CUDA_VISIBLE_DEVICES` to your desired GPU ids.

[#124] FEATURE: add support for multiple GPU nodes #113

Are you sure you want to change the base?

[#124] FEATURE: add support for multiple GPU nodes #113

Uh oh!

Conversation

jacksonjacobs1 commented Aug 26, 2025

Uh oh!

jacksonjacobs1 commented Aug 26, 2025

Description

Proposed solution

Uh oh!

choosehappy left a comment

Choose a reason for hiding this comment

Uh oh!

choosehappy Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

jacksonjacobs1 Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

choosehappy Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

choosehappy Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacksonjacobs1 Sep 2, 2025 •

edited

Loading