Skip to content

Conversation

@dmonakhov
Copy link

CoMMA panics when NCCL communicators are created and destroyed in cycles
(e.g., pytest parametrized tests, training frameworks that recreate process
groups). This affects NCCL 2.27+ which changed profiler API behavior.

Environment

  • NCCL version: 2.28.9+cuda12.9 (also affects 2.27.x)
  • CoMMA commit: 1b2b374 (latest as of 2025-12-05)

reproducer_minimal.cpp

// Reproducer: CoMMA panic on init/destroy cycle (NCCL 2.27+)
// Build: nvcc -o reproducer reproducer_minimal.cpp -lnccl -lcudart
// Run: NCCL_PROFILER_PLUGIN_PATH=libnccl_profiler.so ./reproducer
#include <cuda_runtime.h>
#include <nccl.h>
#include <stdio.h>

int main() {
    cudaSetDevice(0);
    for (int i = 0; i < 3; i++) {
        ncclUniqueId id;
        ncclComm_t comm;
        ncclGetUniqueId(&id);
        printf("Cycle %d: init...", i + 1);
        ncclCommInitRank(&comm, 1, id, 0);  // Panics on cycle 2 with buggy CoMMA
        printf(" destroy...");
        ncclCommDestroy(comm);
        printf(" OK\n");
    }
    printf("SUCCESS\n");
    return 0;
}

Minimal reproducer_pytorch.py

"""
Reproducer: CoMMA panic on init/destroy cycle (NCCL 2.27+)
Run: NCCL_PROFILER_PLUGIN_PATH=libnccl_profiler.so python reproducer_pytorch.py
"""
import os
import torch
import torch.distributed as dist

os.environ.setdefault("MASTER_ADDR", "127.0.0.1")
os.environ.setdefault("MASTER_PORT", "29500")
os.environ.setdefault("RANK", "0")
os.environ.setdefault("WORLD_SIZE", "1")

for i in range(3):
    print(f"Cycle {i + 1}: init...", end="", flush=True)
    dist.init_process_group(backend="nccl", rank=0, world_size=1)
    # Force NCCL init with a collective op (PyTorch uses lazy init)
    t = torch.zeros(1, device="cuda")
    dist.all_reduce(t)
    print(" destroy...", end="", flush=True)
    dist.destroy_process_group()
    print(" OK")

print("SUCCESS")

dmitry-monakhov added a commit to poolsideai/CoMMA that referenced this pull request Dec 5, 2025
PROFILER.set().unwrap() panics when OnceLock is already initialized.
This occurs in multi-communicator scenarios where:

1. PyTorch ProcessGroupNCCL creates multiple communicators
2. NCCL calls profiler_init_v4 for each communicator
3. First call: INIT_FLAG=0, PROFILER.set() succeeds
4. Race condition or library double-load causes PROFILER to be
   already set when INIT_FLAG still shows 0

Root cause: OnceLock::set() returns Err(value) if already initialized,
but code used .unwrap() which panics instead of handling gracefully.

Fix: Check is_ok() before proceeding with daemon spawn and phase_api init.
If PROFILER is already set, skip initialization (already done).

Affects both init_handler (V1-V3) and init_handler_v4 (V4).

Mainstream issue: google#1
Use get_or_init to create Profiler only on first init, reuse on subsequent cycles.
Always respawn daemon when INIT_FLAG=0 (daemon stops on finalize).
@dmonakhov dmonakhov force-pushed the mainstream-init-fix branch from bcf9365 to e5edee1 Compare December 6, 2025 17:23
@dmitry-monakhov
Copy link

Gentle ping. Can you please review this PR. it fixes real issue which prevent CoMMA to be used in faulty tolerant use-case where communicators are recreated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants