Fix panic on duplicate PROFILER init #1

dmonakhov · 2025-12-05T08:55:44Z

CoMMA panics when NCCL communicators are created and destroyed in cycles
(e.g., pytest parametrized tests, training frameworks that recreate process
groups). This affects NCCL 2.27+ which changed profiler API behavior.

Environment

NCCL version: 2.28.9+cuda12.9 (also affects 2.27.x)
CoMMA commit: 1b2b374 (latest as of 2025-12-05)

reproducer_minimal.cpp

// Reproducer: CoMMA panic on init/destroy cycle (NCCL 2.27+)
// Build: nvcc -o reproducer reproducer_minimal.cpp -lnccl -lcudart
// Run: NCCL_PROFILER_PLUGIN_PATH=libnccl_profiler.so ./reproducer
#include <cuda_runtime.h>
#include <nccl.h>
#include <stdio.h>

int main() {
    cudaSetDevice(0);
    for (int i = 0; i < 3; i++) {
        ncclUniqueId id;
        ncclComm_t comm;
        ncclGetUniqueId(&id);
        printf("Cycle %d: init...", i + 1);
        ncclCommInitRank(&comm, 1, id, 0);  // Panics on cycle 2 with buggy CoMMA
        printf(" destroy...");
        ncclCommDestroy(comm);
        printf(" OK\n");
    }
    printf("SUCCESS\n");
    return 0;
}

Minimal reproducer_pytorch.py

"""
Reproducer: CoMMA panic on init/destroy cycle (NCCL 2.27+)
Run: NCCL_PROFILER_PLUGIN_PATH=libnccl_profiler.so python reproducer_pytorch.py
"""
import os
import torch
import torch.distributed as dist

os.environ.setdefault("MASTER_ADDR", "127.0.0.1")
os.environ.setdefault("MASTER_PORT", "29500")
os.environ.setdefault("RANK", "0")
os.environ.setdefault("WORLD_SIZE", "1")

for i in range(3):
    print(f"Cycle {i + 1}: init...", end="", flush=True)
    dist.init_process_group(backend="nccl", rank=0, world_size=1)
    # Force NCCL init with a collective op (PyTorch uses lazy init)
    t = torch.zeros(1, device="cuda")
    dist.all_reduce(t)
    print(" destroy...", end="", flush=True)
    dist.destroy_process_group()
    print(" OK")

print("SUCCESS")

PROFILER.set().unwrap() panics when OnceLock is already initialized. This occurs in multi-communicator scenarios where: 1. PyTorch ProcessGroupNCCL creates multiple communicators 2. NCCL calls profiler_init_v4 for each communicator 3. First call: INIT_FLAG=0, PROFILER.set() succeeds 4. Race condition or library double-load causes PROFILER to be already set when INIT_FLAG still shows 0 Root cause: OnceLock::set() returns Err(value) if already initialized, but code used .unwrap() which panics instead of handling gracefully. Fix: Check is_ok() before proceeding with daemon spawn and phase_api init. If PROFILER is already set, skip initialization (already done). Affects both init_handler (V1-V3) and init_handler_v4 (V4). Mainstream issue: google#1

Use get_or_init to create Profiler only on first init, reuse on subsequent cycles. Always respawn daemon when INIT_FLAG=0 (daemon stops on finalize).

dmitry-monakhov · 2025-12-10T05:50:53Z

Gentle ping. Can you please review this PR. it fixes real issue which prevent CoMMA to be used in faulty tolerant use-case where communicators are recreated

Fix panic on NCCL communicator init/destroy cycles

e5edee1

Use get_or_init to create Profiler only on first init, reuse on subsequent cycles. Always respawn daemon when INIT_FLAG=0 (daemon stops on finalize).

dmonakhov force-pushed the mainstream-init-fix branch from bcf9365 to e5edee1 Compare December 6, 2025 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix panic on duplicate PROFILER init #1

Fix panic on duplicate PROFILER init #1

Uh oh!

dmonakhov commented Dec 5, 2025

Uh oh!

dmitry-monakhov commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix panic on duplicate PROFILER init #1

Are you sure you want to change the base?

Fix panic on duplicate PROFILER init #1

Uh oh!

Conversation

dmonakhov commented Dec 5, 2025

Environment

reproducer_minimal.cpp

Minimal reproducer_pytorch.py

Uh oh!

dmitry-monakhov commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants