Skip to content

Conversation

@jacksonjacobs1
Copy link
Collaborator

@jacksonjacobs1
Copy link
Collaborator Author

Description

The current start_dlproc method in quickannotator does not correctly manage GPU resources when multiple GPUs are available for a single ray actor:

def start_dlproc(self, allow_pred=True):
    if self.getProcRunningSince() is not None:
        self.logger.warning("Already running, not starting again")
        return

    self.logger.info(f"Starting up {build_actor_name(annotation_class_id=self.annotation_class_id)}")
    self.setProcRunningSince()

    total_gpus = ray.cluster_resources().get("GPU", 0)
    self.logger.info(f"Total GPUs available: {total_gpus}")
    scaling_config = ray.train.ScalingConfig(
        num_workers=int(total_gpus),
        use_gpu=True,
        resources_per_worker={"GPU": .01},
        placement_strategy="STRICT_SPREAD"
    )

    trainer = ray.train.torch.TorchTrainer(
        train_pred_loop,
        scaling_config=scaling_config,
        train_loop_config={
            'annotation_class_id': self.annotation_class_id,
            'tile_size': self.tile_size,
            'magnification': self.magnification
        }
    )
    self.hexid = trainer.fit().hex()
    self.logger.info(f"DLActor started with hexid: {self.hexid}")
    return self.hexid

Proposed solution

Configure the ray cluster as follows:

# causes ray to not modify the CUDA_VISIBLE_DEVICES — essentially allowing us to manage them ourselves
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0

# Explicitly manage the CUDA_VISIBLE_DEVICES. This isn't ideal but prevents an error "ValueError: '0' is not in list"
export CUDA_VISIBLE_DEVICES=0,1

Then in start_dlproc we do not need to set the placement strategy. The train_pred_loop function should be modified to look like this:

def trainpred_func2(config):
    print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
    print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
    model = resnet18(num_classes=10)
   
    cuda_dev=torch.device('cuda',ray.train.get_context().get_local_rank())
    model=ray.train.torch.prepare_model(model,cuda_dev)
    time.sleep(10)


scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})
trainer = ray.train.torch.TorchTrainer(trainpred_func2,scaling_config=scaling_config)
trainer.fit()    

Copy link
Owner

@choosehappy choosehappy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this was actually requested for review ; )

"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "0",

// We set CUDA_VISIBLE_DEVICES here, as each container will need to set visible GPUs independently.
"CUDA_VISIBLE_DEVICES": "0,1"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to set this dynamically? what if someone has like, e.g., 10 GPUs, or only 1 GPU?

Copy link
Collaborator Author

@jacksonjacobs1 jacksonjacobs1 Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded values were set for simplicity. This PR is not yet ready for review - I still need to test whether QA works with a multi-node, multi-gpu cluster.

That said, there are ways to set CUDA_VISIBLE_DEVICES dynamically:

  1. Run export command after container setup (can be added to either Dockerfile, devcontainer.json, or docker compose file):

    export CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=uuid --format=csv,noheader | paste -sd "," -)
  2. Using nvidia-container-toolkit's API:
    In devcontainer.json

    {
      "name": "My GPU Dev Container",
      "runArgs": ["--gpus=all"],
      "workspaceFolder": "/workspace"
    }

    In docker compose yaml:

    services:
      app:
        image: your-image
        runtime: nvidia
        environment:
          - NVIDIA_VISIBLE_DEVICES=all
    

But it's unclear to me whether option 2 avoids the bug that you noticed with ray requiring CUDA_VISIBLE_DEVICES to be explicitly set:
ray-project/ray#49985 (comment)

If not, we could run the following command within the container to ensure CUDA_VISIBLE_DEVICES is set:

export CUDA_VISIBLE_DEVICES=$NVIDIA_VISIBLE_DEVICES

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coo coo coo

```

2. Modify `devcontainer.json` to suit your use case. Particularly, change the value of `CUDA_VISIBLE_DEVICES` to your desired GPU ids.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see - are folks likely to read the readme in detail though? or perhaps we should have some explicit messages appear on the screen/log during bootup to draw their attention to these components?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants