-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Description
I encountered a persistent crash when running the evaluation script bash examples/run_openvla_oft_rl_twin2.sh for OpenVLA-OFT on the Robotwin platform. The trainer fails during worker initialization because a Ray worker process exits unexpectedly, causing the register_center_actor to be missing from the registry.
Environment
- Platform: Robotwin
- Script:
bash examples/run_openvla_oft_rl_twin2.sh - Versions:
torch: 2.4.0cuda: 12.2tensorflow: 2.15.0verl: 0.2.0.post2ray: 2.52.1
- Hardware: Single node, 8x H100
Reproduction Script
# Core evaluation command from run_openvla_oft_rl_twin2.sh
set -x
export NCCL_DEBUG=WARN
export WANDB_API_KEY='mykey'
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TOKENIZERS_PARALLELISM=true
export CUDA_LAUNCH_BLOCKING=1
export TORCH_USE_CUDA_DSA=1
export ROBOT_PLATFORM=ALOHA # Use LIBERO: ROBOT_PLATFORM=LIBERO Use Robotwin ROBOT_PLATFORM=ALOHA
PROJECT_NAME='SimpleVLA-RL'
# EXPERIMENT_NAME='MODIFIED YOURSELF e.g. twin2_lift_pot_sft1k_rl_tmp16_clip08-128_batch64'
EXPERIMENT_NAME='robotwin2_place_container_plate_seed1k_sft_aloha_25chunks_10k_eval'
# For openvla-oft Libero-Long traj1 SFT or traj all SFT models can be find in https://huggingface.co/collections/Haozhan72/simplevla-rl-6833311430cd9df52aeb1f86
SFT_MODEL_PATH="/home/myname/Desktop/xlk/SimpleVLA-RL/checkpoints/robotwin2_model"
CKPT_PATH="/home/myname/Desktop/xlk/SimpleVLA-RL/results"
# Currently Supported DATASET_NAME tasks for robotwin2.0 can be found at examples/robotwin2_tasks_info.txt
DATASET_NAME=place_container_plate
TRAJ_MINI_BATCH_SIZE=6 #NEED TO CHECK! The specific values are in examples/robotwin2_tasks_info.txt
VLA_NAME="openvla-oft"
NUM_GPUS=8
# If you want to use 2*8 GPU to RL. Set NUM_NODES=2
NUM_NODES=1
ALIGN_PATH="/home/myname/Desktop/xlk/SimpleVLA-RL/align.json"
bash examples/overwrite_vla_ckpt_utils.sh $SFT_MODEL_PATH
HYDRA_FULL_ERROR=1 python -u -m verl.trainer.main_ppo \
data.task_suite_name=robotwin2_$DATASET_NAME \
data.num_trials_per_task=100 \
data.n_samples=8 \
data.filter_accuracy=True \
data.accuracy_lower_bound=0.1 \
data.accuracy_upper_bound=0.9 \
data.oversample_factor=1 \
data.train_batch_size=64 \
data.val_batch_size=256 \
data.max_prompt_length=256 \
data.max_response_length=128 \
actor_rollout_ref.model.path=$SFT_MODEL_PATH \
actor_rollout_ref.model.vla=$VLA_NAME \
actor_rollout_ref.model.action_token_len=14 \
actor_rollout_ref.model.action_chunks_len=25 \
actor_rollout_ref.model.resume=False \
actor_rollout_ref.actor.optim.lr=5e-6 \
actor_rollout_ref.actor.optim.warmup_style=constant \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size=$NUM_GPUS \
actor_rollout_ref.actor.use_dynamic_bsz=False \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.grad_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.actor.grad_clip=1 \
actor_rollout_ref.actor.clip_ratio_high=0.28 \
actor_rollout_ref.actor.clip_ratio_low=0.2 \
actor_rollout_ref.actor.num_images_in_input=1 \
actor_rollout_ref.actor.traj_mini_batch_size=$TRAJ_MINI_BATCH_SIZE \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.actor.entropy_coeff=0. \
actor_rollout_ref.rollout.twin2_task_config=demo_randomized \
actor_rollout_ref.rollout.twin2_instruction_type=seen \
actor_rollout_ref.rollout.num_images_in_input=1 \
actor_rollout_ref.rollout.use_proprio=True \
actor_rollout_ref.rollout.val_micro_batch_size=8 \
actor_rollout_ref.rollout.temperature=1.6 \
actor_rollout_ref.rollout.experiment_name=$EXPERIMENT_NAME \
actor_rollout_ref.rollout.micro_batch_size=1 \
actor_rollout_ref.rollout.unnorm_key=robotwin2_${DATASET_NAME}_1k \
actor_rollout_ref.rollout.model_family=openvla \
actor_rollout_ref.rollout.task_suite_name=robotwin2_$DATASET_NAME \
actor_rollout_ref.rollout.num_steps_wait=10 \
actor_rollout_ref.rollout.pretrained_checkpoint=$SFT_MODEL_PATH \
actor_rollout_ref.rollout.center_crop=True \
actor_rollout_ref.rollout.max_prompt_length=512 \
actor_rollout_ref.rollout.log_prob_micro_batch_size=32 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=hf \
actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
actor_rollout_ref.ref.log_prob_micro_batch_size=32 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.00 \
trainer.logger=['console','wandb'] \
trainer.project_name=$PROJECT_NAME \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.default_local_dir=$CKPT_PATH/$PROJECT_NAME/$EXPERIMENT_NAME \
trainer.n_gpus_per_node=$NUM_GPUS \
trainer.nnodes=$NUM_NODES \
trainer.save_freq=20 \
trainer.test_freq=4 \
trainer.total_epochs=100 \
trainer.val_only=True \
algorithm.adv_estimator=grpo \
algorithm.adv_params.verifier_gamma=1.0 \
algorithm.adv_params.reward_model_gamma=1.0 \
trainer.runtime_env=$ALIGN_PATH \
trainer.wandb_mode=online \
trainer.val_before_train=True \
Full Error Logs
(pid=gcs_server) [2025-12-30 16:28:52,683 E 88920 88920] (gcs_server) gcs_server.cc:303: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(raylet) [2025-12-30 16:28:55,156 E 89285 89285] (raylet) main.cc:1032: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(pid=89391) [2025-12-30 16:28:56,819 E 89391 89782] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
[2025-12-30 16:28:57,010 E 88542 89388] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(main_task pid=90685) Detected robot platform from environment: ALOHA
(main_task pid=90685) Using ALOHA constants:
(main_task pid=90685) NUM_ACTIONS_CHUNK = 25
(main_task pid=90685) No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
(main_task pid=90685) dataset len: 1000
(main_task pid=90685) dataset len: 256
(main_task pid=90685) Size of train dataloader: 15
(main_task pid=90685) Size of val dataloader: 32
(pid=92031) Detected robot platform from environment: ALOHA
(pid=92031) Using ALOHA constants:
(pid=92031) NUM_ACTIONS_CHUNK = 25
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: Worker ID: Node ID: Worker IP address: <my ip> Worker port: <my port> Worker PID: <my pid> Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.
(bundle_reservation_check_func pid=91825) [2025-12-30 16:29:33,885 E 91825 91969] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 20x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
Error executing job with overrides: ['data.task_suite_name=robotwin2_place_container_plate', 'data.num_trials_per_task=1000', 'data.n_samples=8', 'data.filter_accuracy=True', 'data.accuracy_lower_bound=0.1', 'data.accuracy_upper_bound=0.9', 'data.oversample_factor=1', 'data.train_batch_size=64', 'data.val_batch_size=8', 'data.max_prompt_length=256', 'data.max_response_length=128', 'actor_rollout_ref.model.path=/home/myname/xlk/SimpleVLA-RL/checkpoints/robotwin2_model', 'actor_rollout_ref.model.vla=openvla-oft', 'actor_rollout_ref.actor.ppo_micro_batch_size=2', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=True', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.rollout.val_micro_batch_size=2', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=16', 'actor_rollout_ref.ref.log_prob_micro_batch_size=16', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'trainer.logger=[console,wandb]', 'trainer.project_name=SimpleVLA-RL', 'trainer.experiment_name=robotwin2_place_container_plate_seed1k_sft_aloha_25chunks_10k_eval', 'trainer.default_local_dir=/home/myname/xlk/SimpleVLA-RL/results/SimpleVLA-RL/robotwin2_place_container_plate_seed1k_sft_aloha_25chunks_10k_eval', 'trainer.n_gpus_per_node=2', 'trainer.nnodes=1', 'trainer.val_only=True', 'algorithm.adv_estimator=grpo', 'trainer.runtime_env=/home/myname/xlk/SimpleVLA-RL/align.json', 'trainer.wandb_mode=online', 'trainer.val_before_train=True']
Traceback (most recent call last):
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/main_ppo.py", line 212, in <module>
main()
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/main_ppo.py", line 116, in main
ray.get(main_task.remote(config))
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/worker.py", line 2967, in get
values, debugger_breakpoint = worker.get_objects(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/worker.py", line 1015, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::main_task() (pid=90685, ip=10.244.43.63)
File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/main_ppo.py", line 207, in main_task
trainer.init_workers()
File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/ppo/ray_trainer.py", line 447, in init_workers
wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
File "/home/myname/xlk/SimpleVLA-RL/verl/single_controller/ray/base.py", line 197, in __init__
self._init_with_resource_pool(resource_pool=resource_pool,
File "/home/myname/xlk/SimpleVLA-RL/verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: J3cCBZ_register_center in []
Error Summary
The execution fails during trainer.init_workers(). The root cause is that a Ray worker dies unexpectedly, which prevents the register_center_actor from being found.
1. Ray Worker Exit:
(raylet) A worker died or was killed while executing a task by an unexpected system error.
Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file.
Possible causes: (1) OOM killer, (2) ray stop --force, (3) SIGSEGV.
2. Resulting Traceback:
ray.exceptions.RayTaskError(AssertionError): ray::main_task() (pid=5904, ip=10.244.82.123)
File "verl/trainer/main_ppo.py", line 207, in main_task
trainer.init_workers()
File "verl/trainer/ppo/ray_trainer.py", line 447, in init_workers
wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
...
File "verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: GfZJZT_register_center in []
Request for Assistance
I have spent considerable time trying to debug this issue by adjusting GPU memory utilization, batch sizes, and FSDP offloading settings, but the worker crash during initialization remains persistent.
Since this is happening during the provided evaluation script on a standard 8-GPU setup, I would greatly appreciate it if you could provide some guidance or suggestions on how to resolve this.
Thank you very much for your time and for this project!