You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, may I know why these two reward_fn function differently while they seem to be the same one passed to the PPO trainer as input? In my understanding of PPO, the reward function should output rewards for each sample instead of a sequence of (sequence_length, ) rewards.
Hi, may I know why these two reward_fn function differently while they seem to be the same one passed to the PPO trainer as input? In my understanding of PPO, the reward function should output rewards for each sample instead of a sequence of (sequence_length, ) rewards.
trlx/trlx/trainer/accelerate_ppo_trainer.py
Lines 309 to 310 in 3340c2f
trlx/trlx/trlx.py
Lines 38 to 40 in 3340c2f
The text was updated successfully, but these errors were encountered: