Skip to content

Training ends at a particular step #57

@sivakishoresk

Description

@sivakishoresk

Hello,
I'm trying to replicate the entire training process, it seems it's getting crashed everytime in the 5th iteration. Could you please help me fix this.

Here is the Error trace including the logs of step 3 :

step:3 - timing/gen:305.090 - timing/verify:8.531 - train_verify_score/math:0.381 - train_verify_score/all:0.381 - reward_model/prm_loss:0.720 - reward_model/grad_norm:5.092 - reward_model/dpo_acc_before:0.581 - timing/reward_model:107.155 - train_reward/reward_model:0.023 - train_reward/verifier:0.492 - train_reward/reward_all:2.576 - critic/kl:0.000 - critic/kl_coeff:0.000 - timing/adv:1.017 - timing/update_actor:84.257 - actor/pg_loss:-0.650 - actor/pg_clipfrac:0.002 - actor/ppo_kl:0.000 - actor/grad_norm:0.158 - actor/lr(1e-4):0.005 - critic/score/mean:2.576 - critic/score/max:9.550 - critic/score/min:-4.858 - critic/rewards/mean:2.576 - critic/rewards/max:9.550 - critic/rewards/min:-4.858 - critic/advantages/mean:-0.000 - critic/advantages/max:3.169 - critic/advantages/min:-2.927 - critic/returns/mean:-0.048 - critic/returns/max:2.113 - critic/returns/min:-2.044 - response_length/mean:1435.491 - response_length/max:3072.000 - response_length/min:73.000 - prompt_length/mean:193.648 - prompt_length/max:445.000 - prompt_length/min:141.000
(main_task pid=1367807) logging3: 0.0 seconds
(main_task pid=1367807) 264
(main_task pid=1367807) gen: 97.8 seconds
(main_task pid=1367807) WARNING:2025-02-20 07:19:44,376:WARNING: Error in configuration: macro '\frac' failed its substitution!
(main_task pid=1367807) WARNING:2025-02-20 07:19:44,567:WARNING: Error in configuration: macro '\frac' failed its substitution!
(main_task pid=1367807) Accuracy distribution: 0.00:136 0.25:31 0.50:24 0.75:18 1.00:55
(main_task pid=1367807) Filtered batch size: 292 (from original size: 1056)
(main_task pid=1367807) verify: 2.4 seconds
(main_task pid=1367807) collected 292 / 1024 rollouts and each prompt has 4 responses
(main_task pid=1367807) gen: 97.7 seconds
(main_task pid=1367807) WARNING:2025-02-20 07:21:24,810:WARNING: Error in configuration: macro '\frac' failed its substitution!
(main_task pid=1367807) Accuracy distribution: 0.00:110 0.25:29 0.50:22 0.75:22 1.00:73
(main_task pid=1367807) Filtered batch size: 292 (from original size: 1024)
(main_task pid=1367807) verify: 3.7 seconds
(main_task pid=1367807) collected 584 / 1024 rollouts and each prompt has 4 responses
Error executing job with overrides: ['data.train_files=[/data/siva.gollapalli/AIMO_experiments/PRIME/datasets/RL_data/subset_data/train.parquet]', 'data.val_files=[/data/siva.gollapalli/AIMO_experiments/PRIME/datasets/RL_data/subset_data/validation.parquet]', 'data.train_batch_size=256', 'data.val_batch_size=2', 'data.max_prompt_length=1024', 'data.max_response_length=3072', 'actor_rollout_ref.model.path=/data/siva.gollapalli/AIMO_experiments/PRIME/models/models--PRIME-RL--Eurus-2-7B-SFT/snapshots/3c18b631d89f80876bffb4db2ef0e8a989717a7c', 'actor_rollout_ref.actor.optim.lr=5e-7', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.actor.fsdp_config.param_offload=True', 'actor_rollout_ref.actor.fsdp_config.grad_offload=True', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=True', 'actor_rollout_ref.model.use_remove_padding=False', 'actor_rollout_ref.actor.entropy_coeff=0.', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=64', 'actor_rollout_ref.rollout.tensor_model_parallel_size=1', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.7', 'actor_rollout_ref.ref.log_prob_micro_batch_size=64', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.00', 'trainer.logger=[console,wandb]', 'trainer.project_name=testing', 'trainer.experiment_name=online-after-solvable-0.2-0.8-policy-self-ref', 'trainer.default_local_dir=/data/siva.gollapalli/AIMO_experiments/PRIME/RL_training/testing/online-after-solvable-0.2-0.8-policy-self-ref', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=16', 'trainer.test_freq=16', 'trainer.total_epochs=1', 'data.n_samples=4', 'data.filter_accuracy=True', 'data.accuracy_lower_bound=0.2', 'data.accuracy_upper_bound=0.8', 'algorithm.adv_estimator=rloo', 'algorithm.adv_params.verifier_gamma=1.0', 'algorithm.adv_params.reward_model_gamma=1.0', 'reward_model.rm_type=prime', 'reward_model.rm_coef=5', 'reward_model.prime_model.path=/data/siva.gollapalli/AIMO_experiments/PRIME/models/models--PRIME-RL--Eurus-2-7B-SFT/snapshots/3c18b631d89f80876bffb4db2ef0e8a989717a7c', 'reward_model.prime_model.ref_path=/data/siva.gollapalli/AIMO_experiments/PRIME/models/models--PRIME-RL--Eurus-2-7B-SFT/snapshots/3c18b631d89f80876bffb4db2ef0e8a989717a7c', 'reward_model.model.input_tokenizer=null', 'reward_model.prime_model.use_remove_padding=False', 'reward_model.prime_granularity=token', 'reward_model.micro_batch_size=8', 'reward_model.prime_model.update=before', 'reward_model.prime_model.beta_train=0.05', 'reward_model.prime_model.optim.lr=1e-6', 'reward_model.prime_model.optim.grad_clip=10.0', 'reward_model.prime_model.input_tokenizer=null', 'trainer.default_local_dir=/data/siva.gollapalli/AIMO_experiments/PRIME/RL_training/testing/online-after-solvable-0.2-0.8-policy-self-ref']
Traceback (most recent call last):
File "/data/siva.gollapalli/AIMO_experiments/PRIME/PRIME/training/verl/trainer/main_ppo.py", line 131, in main
ray.get(main_task.remote(config))
File "/data/siva.gollapalli/AIMO_experiments/PRIME/prime_train_env/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/data/siva.gollapalli/AIMO_experiments/PRIME/prime_train_env/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/data/siva.gollapalli/AIMO_experiments/PRIME/prime_train_env/lib/python3.10/site-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/data/siva.gollapalli/AIMO_experiments/PRIME/prime_train_env/lib/python3.10/site-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::main_task() (pid=1367807, ip=10.67.28.2)
File "/data/siva.gollapalli/AIMO_experiments/PRIME/PRIME/training/verl/trainer/main_ppo.py", line 223, in main_task
trainer.fit()
File "/data/siva.gollapalli/AIMO_experiments/PRIME/PRIME/training/verl/trainer/ppo/ray_trainer.py", line 626, in fit
val_metrics = self._validate()
File "/data/siva.gollapalli/AIMO_experiments/PRIME/PRIME/training/verl/trainer/ppo/ray_trainer.py", line 321, in _validate
test_output_gen_batch = self.actor_rollout_wg.generate_sequences(test_gen_batch)
File "/data/siva.gollapalli/AIMO_experiments/PRIME/PRIME/training/verl/single_controller/ray/base.py", line 39, in func
args, kwargs = dispatch_fn(self, *args, **kwargs)
File "/data/siva.gollapalli/AIMO_experiments/PRIME/PRIME/training/verl/single_controller/base/decorator.py", line 275, in dispatch_dp_compute_data_proto
splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs)
File "/data/siva.gollapalli/AIMO_experiments/PRIME/PRIME/training/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto
splitted_args.append(arg.chunk(chunks=chunks))
File "/data/siva.gollapalli/AIMO_experiments/PRIME/PRIME/training/verl/protocol.py", line 415, in chunk
DataProto(batch=batch_lst[i], non_tensor_batch=non_tensor_batch_lst[i], meta_info=self.meta_info))
IndexError: tuple index out of range

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions