Skip to content

AssertionError: failed to get register_center_actor` due to Ray Worker Crash (code 2) during evaluation #71

@Griseo-Kaslana

Description

@Griseo-Kaslana

Description

I encountered a persistent crash when running the evaluation script bash examples/run_openvla_oft_rl_twin2.sh for OpenVLA-OFT on the Robotwin platform. The trainer fails during worker initialization because a Ray worker process exits unexpectedly, causing the register_center_actor to be missing from the registry.

Environment

  • Platform: Robotwin
  • Script: bash examples/run_openvla_oft_rl_twin2.sh
  • Versions:
    • torch: 2.4.0
    • cuda: 12.2
    • tensorflow: 2.15.0
    • verl: 0.2.0.post2
    • ray: 2.52.1
  • Hardware: Single node, 8x H100

Reproduction Script

# Core evaluation command from run_openvla_oft_rl_twin2.sh
set -x

export NCCL_DEBUG=WARN 
export WANDB_API_KEY='mykey'
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TOKENIZERS_PARALLELISM=true
export CUDA_LAUNCH_BLOCKING=1
export TORCH_USE_CUDA_DSA=1
export ROBOT_PLATFORM=ALOHA  # Use LIBERO: ROBOT_PLATFORM=LIBERO  Use Robotwin ROBOT_PLATFORM=ALOHA
PROJECT_NAME='SimpleVLA-RL'
# EXPERIMENT_NAME='MODIFIED YOURSELF e.g. twin2_lift_pot_sft1k_rl_tmp16_clip08-128_batch64' 
EXPERIMENT_NAME='robotwin2_place_container_plate_seed1k_sft_aloha_25chunks_10k_eval' 
# For openvla-oft Libero-Long traj1 SFT or traj all SFT models can be find in https://huggingface.co/collections/Haozhan72/simplevla-rl-6833311430cd9df52aeb1f86
SFT_MODEL_PATH="/home/myname/Desktop/xlk/SimpleVLA-RL/checkpoints/robotwin2_model"
CKPT_PATH="/home/myname/Desktop/xlk/SimpleVLA-RL/results"
# Currently Supported DATASET_NAME tasks for robotwin2.0 can be found at examples/robotwin2_tasks_info.txt
DATASET_NAME=place_container_plate
TRAJ_MINI_BATCH_SIZE=6 #NEED TO CHECK! The specific values are in examples/robotwin2_tasks_info.txt
VLA_NAME="openvla-oft"
NUM_GPUS=8
# If you want to use 2*8 GPU to RL. Set NUM_NODES=2
NUM_NODES=1 
ALIGN_PATH="/home/myname/Desktop/xlk/SimpleVLA-RL/align.json"
bash examples/overwrite_vla_ckpt_utils.sh $SFT_MODEL_PATH 

HYDRA_FULL_ERROR=1 python -u -m verl.trainer.main_ppo \
    data.task_suite_name=robotwin2_$DATASET_NAME \
    data.num_trials_per_task=100 \
    data.n_samples=8 \
    data.filter_accuracy=True \
    data.accuracy_lower_bound=0.1 \
    data.accuracy_upper_bound=0.9 \
    data.oversample_factor=1 \
    data.train_batch_size=64 \
    data.val_batch_size=256 \
    data.max_prompt_length=256 \
    data.max_response_length=128 \
    actor_rollout_ref.model.path=$SFT_MODEL_PATH \
    actor_rollout_ref.model.vla=$VLA_NAME \
    actor_rollout_ref.model.action_token_len=14 \
    actor_rollout_ref.model.action_chunks_len=25 \
    actor_rollout_ref.model.resume=False \
    actor_rollout_ref.actor.optim.lr=5e-6 \
    actor_rollout_ref.actor.optim.warmup_style=constant \
    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
    actor_rollout_ref.actor.ppo_micro_batch_size=$NUM_GPUS \
    actor_rollout_ref.actor.use_dynamic_bsz=False \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.grad_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.actor.grad_clip=1 \
    actor_rollout_ref.actor.clip_ratio_high=0.28 \
    actor_rollout_ref.actor.clip_ratio_low=0.2 \
    actor_rollout_ref.actor.num_images_in_input=1 \
    actor_rollout_ref.actor.traj_mini_batch_size=$TRAJ_MINI_BATCH_SIZE \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.model.use_remove_padding=False \
    actor_rollout_ref.actor.entropy_coeff=0. \
    actor_rollout_ref.rollout.twin2_task_config=demo_randomized \
    actor_rollout_ref.rollout.twin2_instruction_type=seen \
    actor_rollout_ref.rollout.num_images_in_input=1 \
    actor_rollout_ref.rollout.use_proprio=True \
    actor_rollout_ref.rollout.val_micro_batch_size=8 \
    actor_rollout_ref.rollout.temperature=1.6 \
    actor_rollout_ref.rollout.experiment_name=$EXPERIMENT_NAME \
    actor_rollout_ref.rollout.micro_batch_size=1 \
    actor_rollout_ref.rollout.unnorm_key=robotwin2_${DATASET_NAME}_1k \
    actor_rollout_ref.rollout.model_family=openvla \
    actor_rollout_ref.rollout.task_suite_name=robotwin2_$DATASET_NAME \
    actor_rollout_ref.rollout.num_steps_wait=10 \
    actor_rollout_ref.rollout.pretrained_checkpoint=$SFT_MODEL_PATH \
    actor_rollout_ref.rollout.center_crop=True \
    actor_rollout_ref.rollout.max_prompt_length=512 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=32 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=hf \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=32 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.00 \
    trainer.logger=['console','wandb'] \
    trainer.project_name=$PROJECT_NAME \
    trainer.experiment_name=$EXPERIMENT_NAME \
    trainer.default_local_dir=$CKPT_PATH/$PROJECT_NAME/$EXPERIMENT_NAME \
    trainer.n_gpus_per_node=$NUM_GPUS \
    trainer.nnodes=$NUM_NODES \
    trainer.save_freq=20 \
    trainer.test_freq=4 \
    trainer.total_epochs=100 \
    trainer.val_only=True \
    algorithm.adv_estimator=grpo \
    algorithm.adv_params.verifier_gamma=1.0 \
    algorithm.adv_params.reward_model_gamma=1.0 \
    trainer.runtime_env=$ALIGN_PATH \
    trainer.wandb_mode=online \
    trainer.val_before_train=True \


Full Error Logs

(pid=gcs_server) [2025-12-30 16:28:52,683 E 88920 88920] (gcs_server) gcs_server.cc:303: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(raylet) [2025-12-30 16:28:55,156 E 89285 89285] (raylet) main.cc:1032: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(pid=89391) [2025-12-30 16:28:56,819 E 89391 89782] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
[2025-12-30 16:28:57,010 E 88542 89388] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(main_task pid=90685) Detected robot platform from environment: ALOHA
(main_task pid=90685) Using ALOHA constants:
(main_task pid=90685)   NUM_ACTIONS_CHUNK = 25
(main_task pid=90685) No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
(main_task pid=90685) dataset len: 1000
(main_task pid=90685) dataset len: 256
(main_task pid=90685) Size of train dataloader: 15
(main_task pid=90685) Size of val dataloader: 32
(pid=92031) Detected robot platform from environment: ALOHA
(pid=92031) Using ALOHA constants:
(pid=92031)   NUM_ACTIONS_CHUNK = 25
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID:  Worker ID:  Node ID:  Worker IP address: <my ip> Worker port: <my port> Worker PID: <my pid> Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.
(bundle_reservation_check_func pid=91825) [2025-12-30 16:29:33,885 E 91825 91969] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 20x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
Error executing job with overrides: ['data.task_suite_name=robotwin2_place_container_plate', 'data.num_trials_per_task=1000', 'data.n_samples=8', 'data.filter_accuracy=True', 'data.accuracy_lower_bound=0.1', 'data.accuracy_upper_bound=0.9', 'data.oversample_factor=1', 'data.train_batch_size=64', 'data.val_batch_size=8', 'data.max_prompt_length=256', 'data.max_response_length=128', 'actor_rollout_ref.model.path=/home/myname/xlk/SimpleVLA-RL/checkpoints/robotwin2_model', 'actor_rollout_ref.model.vla=openvla-oft', 'actor_rollout_ref.actor.ppo_micro_batch_size=2', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=True', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.rollout.val_micro_batch_size=2', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=16', 'actor_rollout_ref.ref.log_prob_micro_batch_size=16', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'trainer.logger=[console,wandb]', 'trainer.project_name=SimpleVLA-RL', 'trainer.experiment_name=robotwin2_place_container_plate_seed1k_sft_aloha_25chunks_10k_eval', 'trainer.default_local_dir=/home/myname/xlk/SimpleVLA-RL/results/SimpleVLA-RL/robotwin2_place_container_plate_seed1k_sft_aloha_25chunks_10k_eval', 'trainer.n_gpus_per_node=2', 'trainer.nnodes=1', 'trainer.val_only=True', 'algorithm.adv_estimator=grpo', 'trainer.runtime_env=/home/myname/xlk/SimpleVLA-RL/align.json', 'trainer.wandb_mode=online', 'trainer.val_before_train=True']
Traceback (most recent call last):
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/main_ppo.py", line 212, in <module>
    main()
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/main_ppo.py", line 116, in main
    ray.get(main_task.remote(config))
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/worker.py", line 2967, in get
    values, debugger_breakpoint = worker.get_objects(
  File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/worker.py", line 1015, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::main_task() (pid=90685, ip=10.244.43.63)
  File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/main_ppo.py", line 207, in main_task
    trainer.init_workers()
  File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/ppo/ray_trainer.py", line 447, in init_workers
    wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
  File "/home/myname/xlk/SimpleVLA-RL/verl/single_controller/ray/base.py", line 197, in __init__
    self._init_with_resource_pool(resource_pool=resource_pool,
  File "/home/myname/xlk/SimpleVLA-RL/verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
    assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: J3cCBZ_register_center in []

Error Summary

The execution fails during trainer.init_workers(). The root cause is that a Ray worker dies unexpectedly, which prevents the register_center_actor from being found.

1. Ray Worker Exit:

(raylet) A worker died or was killed while executing a task by an unexpected system error. 
Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. 
Possible causes: (1) OOM killer, (2) ray stop --force, (3) SIGSEGV.

2. Resulting Traceback:

ray.exceptions.RayTaskError(AssertionError): ray::main_task() (pid=5904, ip=10.244.82.123)
  File "verl/trainer/main_ppo.py", line 207, in main_task
    trainer.init_workers()
  File "verl/trainer/ppo/ray_trainer.py", line 447, in init_workers
    wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
  ...
  File "verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
    assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: GfZJZT_register_center in []

Request for Assistance

I have spent considerable time trying to debug this issue by adjusting GPU memory utilization, batch sizes, and FSDP offloading settings, but the worker crash during initialization remains persistent.

Since this is happening during the provided evaluation script on a standard 8-GPU setup, I would greatly appreciate it if you could provide some guidance or suggestions on how to resolve this.

Thank you very much for your time and for this project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions