Discussion about this post

User's avatar
Sebastian Raschka, PhD's avatar

Thanks for sharing! Sorry, I don't want to provide unsolicited feedback, but since you reached out, I did a quick read and hope you don't mind if I share my thoughts that may or may not be helpful :)

- I think you were training/fine-tuning an already strong model? That's always trickier in practice. Just to debug your setup I would start with a base model to see if you get any gains first.

- Ah yes, custom tags are always tricky. Did you check whether they (e.g., <solution>) were supported by the tokenizer and are not broken down into subword tokens? Also, even if they are supported, the model may not be super familiar with them / has rarely seen them so some pre-training on

- Since you are using a 0.5B model anyways, maybe you'd have fun trying my pure PyTorch implementation of GRPO on your dataset: https://github.com/rasbt/reasoning-from-scratch/blob/main/ch06/01_main-chapter-code/ch06_main.ipynb (the accuracy improves 15% -> 48% there)

- "one more point here about the loss heuristic" I wouldn't focus so much on the loss when doing RL, just make sure there are no crazy spike etc. What's more interesting is the advantage stats, reward averages, entropy and so on

No posts

Ready for more?