-
Notifications
You must be signed in to change notification settings - Fork 27
[#124] FEATURE: add support for multiple GPU nodes #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v2.0
Are you sure you want to change the base?
[#124] FEATURE: add support for multiple GPU nodes #113
Conversation
DescriptionThe current start_dlproc method in quickannotator does not correctly manage GPU resources when multiple GPUs are available for a single ray actor: def start_dlproc(self, allow_pred=True):
if self.getProcRunningSince() is not None:
self.logger.warning("Already running, not starting again")
return
self.logger.info(f"Starting up {build_actor_name(annotation_class_id=self.annotation_class_id)}")
self.setProcRunningSince()
total_gpus = ray.cluster_resources().get("GPU", 0)
self.logger.info(f"Total GPUs available: {total_gpus}")
scaling_config = ray.train.ScalingConfig(
num_workers=int(total_gpus),
use_gpu=True,
resources_per_worker={"GPU": .01},
placement_strategy="STRICT_SPREAD"
)
trainer = ray.train.torch.TorchTrainer(
train_pred_loop,
scaling_config=scaling_config,
train_loop_config={
'annotation_class_id': self.annotation_class_id,
'tile_size': self.tile_size,
'magnification': self.magnification
}
)
self.hexid = trainer.fit().hex()
self.logger.info(f"DLActor started with hexid: {self.hexid}")
return self.hexidProposed solutionConfigure the ray cluster as follows: # causes ray to not modify the CUDA_VISIBLE_DEVICES — essentially allowing us to manage them ourselves
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0
# Explicitly manage the CUDA_VISIBLE_DEVICES. This isn't ideal but prevents an error "ValueError: '0' is not in list"
export CUDA_VISIBLE_DEVICES=0,1Then in start_dlproc we do not need to set the placement strategy. The train_pred_loop function should be modified to look like this: def trainpred_func2(config):
print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
model = resnet18(num_classes=10)
cuda_dev=torch.device('cuda',ray.train.get_context().get_local_rank())
model=ray.train.torch.prepare_model(model,cuda_dev)
time.sleep(10)
scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})
trainer = ray.train.torch.TorchTrainer(trainpred_func2,scaling_config=scaling_config)
trainer.fit() |
choosehappy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this was actually requested for review ; )
| "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "0", | ||
|
|
||
| // We set CUDA_VISIBLE_DEVICES here, as each container will need to set visible GPUs independently. | ||
| "CUDA_VISIBLE_DEVICES": "0,1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to set this dynamically? what if someone has like, e.g., 10 GPUs, or only 1 GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded values were set for simplicity. This PR is not yet ready for review - I still need to test whether QA works with a multi-node, multi-gpu cluster.
That said, there are ways to set CUDA_VISIBLE_DEVICES dynamically:
-
Run export command after container setup (can be added to either Dockerfile, devcontainer.json, or docker compose file):
export CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=uuid --format=csv,noheader | paste -sd "," -)
-
Using nvidia-container-toolkit's API:
In devcontainer.json{ "name": "My GPU Dev Container", "runArgs": ["--gpus=all"], "workspaceFolder": "/workspace" }In docker compose yaml:
services: app: image: your-image runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=all
But it's unclear to me whether option 2 avoids the bug that you noticed with ray requiring CUDA_VISIBLE_DEVICES to be explicitly set:
ray-project/ray#49985 (comment)
If not, we could run the following command within the container to ensure CUDA_VISIBLE_DEVICES is set:
export CUDA_VISIBLE_DEVICES=$NVIDIA_VISIBLE_DEVICESThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coo coo coo
| ``` | ||
|
|
||
| 2. Modify `devcontainer.json` to suit your use case. Particularly, change the value of `CUDA_VISIBLE_DEVICES` to your desired GPU ids. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see - are folks likely to read the readme in detail though? or perhaps we should have some explicit messages appear on the screen/log during bootup to draw their attention to these components?
https://jacksonjjacobs.com/openproject/work_packages/124