Months of (Failed) RL Experiments

A Beginner's Post-Mortem

Feb 11, 2026

Hi,

The purpose of this blogpost is to share all my dumb mistakes and insights from my recent project. We were trying to finetune a 8B model, by first performing SFT on a general reasoning dataset followed by RLFT on our custom dataset.

By no means should I be considered an expert in coding or in finetuning. This was the first end-to-end technical project I have ever done. So this blog post is primarily intended for beginners like me, driven by a keen desire to share what I found useful.

This blog does not explain when, where, why, how etc. about RLFT and GRPO. That will be a different post. It is expected you kind-of have a basic understanding before getting into this.

Resources:

Synthetic Dataset Generation: GitHub Link
Finetuning Pipeline with Eval (Focus of this Blog): Colab Link

This work would never have been possible without the help and support of Sriram Hebbale, who experienced these frustrations and challenges along with me, as well as Dr. Dhruv Kumar who guided us through this project. I’ve also borrowed a good deal of insight from Gaurish Trivedi .

Let’s Dig into Finetuning

Firstly, finetuning is not some gamechanger. It’ll not magically solve your problems. It sounds very exciting to make domain specific models etc etc. But finetuning at the scale we do will not cause massive changes. There is a strong difference between conducting small experiments (like the ones we are able to do as students) and production-ready stuff. In practicality, the easiest thing you can do is pay for the APIs of the best models, do some trial-and-error to get to good prompts and that’s it.

Anyways, let’s talk about how to finetune.

We spent a lot of time and energy to finetune a small model on a small dataset. Thinking if we could do that it’ll be a breakthrough, only to observe not so impressive results. The obvious answer to improve the performance was, Pick a Bigger Dataset and a Better Model. So as students doing projects this balance between knowing the bigger picture is also crucial. This is exactly where consulting with your professor comes in very very handy.

So there are a few components we have to understand here:

Model
Training Recipe
Dataset
Compute Resources
Technical Ability of the Engineer.

I felt I tried to single out on one or the other, only to realise they were all quite interrelated.

A Fundamental Difference in SFT & RLFT Datasets

Next, I want to underline a very important point I think needs to be understood properly. There is a fundamental difference between the kind of datasets you need for Supervised Finetuning and RL Finetuning of LLMs.

In the simplest sense, SFT is imitation. So you will want to give it a bunch of Answers by smarter LLMs. If you give it chunks of a book, it will learn how to speak like that. The interesting thing here is that when you give Answers by smarter LLMs to small LLMs and fine-tune their reasoning capability increases by a lot. A lot. This is simple, easy and tangible. Just like you shouldn’t consider Finetuning without first thinking about RAG and Prompt Engineering in a problem setting. Similarly I will strongly suggest just trying SFT first. It’s much simpler to implement than RLFT and will give you better baseline improvements.

I’ll also give a short intro to RLFT before coming back to our fundamental dataset difference. Firstly, be clear what your loss function is. In SFT it is cross-entropy loss, in simpler terms - it is like a weighted mean squared error for text/tokens. But in RLFT, you define “reward functions”, the idea is simple, you are rewarding a particular behavior.

SFT vs RLFT: Think of SFT as providing the Demonstrations (the path), while RLFT requires Verifiable Outcomes (the destination).

Example: If your problem is to solve Sudoku, you can have a reward where it gets points for solving the small box. We then take all these points in total, normalise it, and then clip it a bit (adding min,max) and add a penalty (subtract some stuff) to make sure we don’t have too big of a loss value. [the core idea here is that we want to make the updates in RLFT very small because it goes wild when you allow too much, again it’s a point of finding balance here].

The main thing I want to underline here is that the loss functions are fundamentally different. So, in SFT our job is:

Learn(/imitate) how to speak in a particular way
Copy how smart people think out loud => your thinking quality (reasoning traces) will improve => you’ll give better answers.

This is not the case in RLFT. Take a Dataset like JEEBench. It has objective questions. A Column for MCQ problem statement. A column with the final correct option. To perform RLFT in the simplest way, I will give the problem statement in the prompt. Then the LLM will give a long answer with the final answer at the end. I will extract this final answer. Check if it is correct, and give marks.

So, JEEBench is a dataset on which I can do RLFT. Actually if you think, you can do RLFT on any given dataset, no matter how old. However, this is not the case with SFT. If I have to do SFT on JEEBench, the simplest way is to give all the problems one by one to the smartest LLM Ex: Gemini 3, then copy its answers into a dataset. Then perform Imitation learning (aka SFT).

This dataset thing is the first thing you have to get into your head.

The Nuance of Format & Reward Functions

Now let me enter into the second half of the dataset nuance. Just like when someone is doing any Neural Network training, the loss function matters A LOT. Similarly, when doing RL Finetuning, the things which will matter the most are your reward functions.

Deepseek people went for these two reward functions:

Accuracy -> Checking the Answer is correct
Format -> The answer of the LLM is in the proper format “<start_thinking>....abcxyz…<end_thinking><start_solution>...abcxyz…<end_solution>”

This is very simple and easy to do.

There are three things I want to highlight here:

1] How is finetuning improving the answer quality?

We are assuming that when the reasoning process is “better” in some way or the other, it will lead to a correct final answer. In SFT, we do it by teaching the small LLM to speak in the way of smarter LLMs. In RLFT, we do this by a carrot and stick method for the format and the final answer accuracy. That’s it. You can say shouldn’t we do step-marking or partial marking? Wouldn’t that be a good way to make the reasoning better. One, Deepseek’s Jan 25 experiments say otherwise. (Read PRMs). Two, it’s not easy to do. Obviously it can be done and sounds like a great idea. But won’t comment on it now. Even in our own simple experiment, text extraction was very tricky to get right. And if extraction is wrong, then whole RLFT process collapses

2] What exactly are reward functions?

They are basically text Extractors followed by some Verification. I don’t know why, but I didn’t think about it that way. And coded it in a wrong way.

3] The Format Trap (Very Important)

Coming to format rewards. See because of Deepseek’s success, it’s like a no-brainer to keep them. But there is one extremely important nuance here. Remember that RLFT is often preceded by a SFT Step. And in SFT it learns to speak in a particular way. So if you’re going with that, what you’ll need to make sure is that the format you’re aiming for through RLFT is the same which is being enforced through the SFT process.

The whole build-up was for you to understand this point number 3 properly. I can’t tell you how exasperated I was when I realised that this is where I was going wrong after months.

This write-up should also help you understand what is a “reasoning dataset”. It is the stuff you need to do SFT, the reasoning traces, even if you have a good old normal dataset with questions and answers. (On which you can perform RLFT ; cause here we extract from generated answers and verify against our dataset to give points).

Small Note: Just for an experiment, instead of doing SFT on standard reasoning traces which had the “thinking out load steps”, I directly did it on text which contained the Code Solution for the problems followed by a number which was generated by running that code. As you know, LLMs do not know how to run code. So it learnt the format which I taught forcibly, gave me some code and a number at the end. But the Eval results were very bad. Which tells you that the number at the end was fully hallucinated.

Chat Templates and Tokenisers

I’m wrapping up the Dataset Section to now come to Chat Templates. For some reason, this concept took me the longest to wrap my head around.

First think of SFT, Let’s say you take any normal dataset like JEEBench. You have a question and a solution. Let’s say you also copy-pasted the answers from a Gemini 3 (these are called reasoning traces), to make your own dataset. But one thing you have to remember is that LLMs are trained how to speak to humans in a systematic way. So whatever base model you have chosen, it already has been taught a particular format of speaking. And you need to change your dataset to match it. [It sound simple, but it was a pain to get this done properly, there’s no shortcut for it, and will change for every base dataset] In one sense, this is the one step which will make your reasoning traces “machine readable”.

Look at these: Alpaca Style, ShareGPT Style, ChatML.

Firstly, these 3 formats are for the same task but for different types of models. But we are having fine tuning in different tasks. You need raw texts in continued pretraining (in the beginning when the model is learning grammar), and the format for single-turn conversations (single prompt and answer ; like when you do a llm call with gemini api) and multi-turn conversations (a conversation with chatgpt where it remembers the history).

Note: Handling memory in multi-turn conversations is an extremely important area in itself, but we won’t digress into that because I don’t know anything about it as my experiment was on single turn stuff.

Next to Chat Templates, is understanding “Tokenisers”.

Everyone here surely must know we break text into tokens, a series of numbers (represented by a vector). So we need to do that with our dataset as well. Not only the fields are converted into tokens, each of them is assigned a specific ID. You might have studied that LLMs have like a vocabulary of (practically) all the tokens and then in their final step (next token prediction) they classify to reach one of these. Similarly, in the input step also we tag all out tokens with their IDs as per their specific position in the vocabulary(/final step). Lastly, in the tokenisation step model-specific markers are also added on a token level.

Clarification: Think of a Chat Template as a function that transforms a list of messages (dictionaries) into a single string. This string is then tokenized.
The confusion usually lies between Special Tokens (like <|im_start|> which structure the conversation) and Reasoning Tags (like <think> which structure the content).

There is an important point here to clarify. In our experiment our base dataset “OpenR1-Math-220k” used specific tags like <think> and \boxed{}. However we decided to shift to custom tags: <start_working_out> and <SOLUTION>. We also add our system prompts at this step.

We needed to manually do this rewrapping. What I want to clarify is that the tokeniser also does a rewrapping, but that comes after this. This is what the Chat Template is. Chat Template has to do with markers above the text, it has nothing to do with what is inside the text. And this is applied directly to the token level. However, like we mentioned before, we want to align our tags within the reasoning traces to what we were going to use in the RLFT Reward Functions. This has to do within the text. This we had to do manually. Lastly, we put everything (the tokens) into a single column on which training will be performed.

Context Limits

Before we get to train using SFTTrainer or GRPOtrainer. There’s one more piece to take care of: context limits.

We encounter length variables at 4 different places.

max_seq_length in Data Prep before SFT
In RFT, we encounter: max_prompt_length = maximum_length + 1
max_completion_length = max_seq_length - max_prompt_length
In Eval, we have max_new_tokens as a generation parameter

Sequence length is important in many ways. Firstly, depending on the task you might want to have big lengths (Claude released a model with 1 million for giant codebases). There are also training recipes which perform fine tuning first on shorter context lengths and later on bigger ones (Refer Fathom-R1). Secondly, truncation is like the most common issues I encountered in my eval diagnosis. It makes extraction impossible plus you lose out on useful information in the SFT Phase.

When I initially ran the training pipeline for RLFT, I got numbers around 500 Hrs. A good friend pointed out that my context length was 16k, which meant I should have a Top-tier LLM (I did not, I was on a 0.5B Model). So Sequence length is closely related to complexity of your task, the training recipe you’re following, the model capabilities, your expected training time. And most importantly your dataset.

Constraint: When Small models are asked to answer in too long sequences, it actually deteriorates the answer quality.

When I inspected my dataset more closely I found that I had a good enough number of sample within 6k.

The Library Stack (Unsloth & Friends)

So, when you want to do the fine tuning, there are a number of libraries at play.

For interacting with GPUs, we have CUDA.
Pytorch is the underlying thing on which the models have been coded.
We obviously also have numpy for matrix multiplications.
The layer between CUDA and Pytorch is Triton. OpenAI guys made this, it helps you to not directly write CUDA, but it is like python style code and does a lot of optimisations for you.
Transformers library by Huggingface is like the name implies.
Hf also has TRL, which stands for transformers reinforcement learning. This contains the SFTTrainer and GRPOTrainer classes that we use in training.
Upon this, you have something called PEFT (Parameter Efficient Fine Tuning), this basically lets you do fine tuning on adapter layers (search up LoRA and Q-LoRA).

Now, two things. Firstly, this PEFT thing by itself is very slow. Which can be expected because you are dealing with billions of parameters. For this, a lot of people use Unsloth. (which utilises vLLM).

Coming back to DL basics, even in training, you will need to perform the entire pass. So therefore inference happens at every step, from SFT to GRPO to Eval. vLLM helps you to do eval faster. It’s very cool, read up on it. It does something similar to paging on OS systems on KV Cache (memory for Attention mechanisms).

Now, Unsloth neatly wraps everything for you. Secondly, they replace some standard optimisations with hand-written triton code for different models, which makes it faster. It works well, I’ve used it a lot. Surely recommend it. I’ve also been told about Axolotl and Accelerate with DeepSpeed. But never tried them out.

So first point is that Unsloth is quite nice and fast, rather than doing it yourself on TRL and PEFT.

Secondly, You might have noticed that I have just listed a lot of libraries. Unfortunately there are some which I have not mentioned. Even more unfortunate is that these keep updating and keep making dependency errors.

What worked for me is to find a mix that worked and fix those versions, rather than using the most updated versions of each. As per general coding guidelines, you shouldn’t do this, but yeah this is what worked for me. I have spent hours of my life going through dependency errors, and I do not want to relive that experience again.

PS: The pip dependency resolver is a lifesaver.

Model Loading

While we were initially using a combined notebook, we had this great epiphany of dividing our workflow into different Colab notebooks. However when we tried to do that everything crashed.

Let’s now talk about the first step of your finetuning pipeline. Loading the Model which you will finetune. So these models are stored in the HF library. And we generally use that to load models from their servers. However, because we want to use Unsloth, with their optimised “kernels”, we use a method by Unsloth called FastLanguageModel. This can help you get pretrained models as well as adding the adapter layers through PEFT.

What we tried to do was load the base model, finetune it with SFT. Save it on Google Drive, then load it again for RLFT. But here we found we couldn’t leverage the Unsloth Optimizations. I tried every combination that came to my mind, from saving only the model to merged weights to only the adapters, to using different import functions for model loading. But it wasn’t coming together. So we just went with doing the whole pipeline in a single pipeline.

I’m sure there’s something I’m missing and this cannot be the way to do it. If you know how, please please let me know.

Dataset Size

When I started working on this project the most common question I had was, How big of a dataset do I need? I tried to ask this to every “expert” person I could find.

Thumb Rules: Always, Garbage In Garbage Out. i.e. Small High quality Datasets are better than Large useless ones.

Your base models already understand grammar and have some reasoning capability, through finetuning you’re trying to purposefully overfit them to your task. Quality of a dataset means [1] Accuracy [2] Diversity. In tasks like JEEBench, it can also refer to the complexity of tasks. Once you have this sorted, go for as big as you can get. To get started on a base pipeline, 10k for SFT and 5k for RLFT can get you going.

But, we ask a larger question here. On what basis do we decide the size of the Dataset? Let’s take a step back. What is the purpose of the Dataset? We want to finetune our dataset. So whatever size is helping us get to a good finetuning is the right size. But then now what is this measure of a good finetuning? Think traditional ML now, how would we know the training process has gone well? It is when our loss converged. Exactly in the same way we want our loss to converge and come down to a low enough value here. Generally this is 0.3-0.7 (normalised).

Broadly speaking, the question of a dataset size is answered. I want to make one more point here about the loss heuristic. Someone can ask why the lower bound of 0.3. Why don’t we want the loss to converge to 0.1 or something like that. Well the difference between LLM and traditional ML is that here we have a generative task. So while we are teaching the LLM to “imitate” the way of speaking, obviously we don’t want it to try to copy our dataset word for word, that might break its own reasoning thinking process. Another way to think about this is that your Overfitting can happen at a slightly higher loss value when doing SFT on LLMs.

Memory Management and OOM

Now, I want to come to Memory. The most common error we encounter is OOM. It means Out of Memory. We generally have 3 types of memory:

Disk (slow i/o speed)
System RAM (faster)
GPU RAM or VRAM (fastest)

OOM generally refers to GPU RAM/V RAM. Because of the different speeds, to optimize our programs, different types of programs interact with different memory spaces accordingly.

Disk refers to /content/drive/ in Colab, which is our Google Drive where we can store our finetuned models.
System RAM stores the OS, helps us run the python scripts and contains the datasets before we send them to the GPU.
VRAM is where the LLM Model weights and gradients are stored. So this is most important.

When you have a bigger model, you not only have bigger weights, but also more gradients and “optimiser states” (which track how to update the model).

We usually run into OOM in VRAM when:

Too large of a Model is loaded (Model Loading)
If the sequences are too long (SFT/RLFT)
If a single batch contains too many sequences (SFT)
If num of generations in GRPO is too high (as they are processed simultaneously)

When we save our model using Merged Weights, the systems tried to hold both the base and the merged model at the same time in the System RAM. That can also lead to an OOM.

In our experience, we were initially trying to load OpenThoughts3-1.2M and then take a small portion of it for training, but because of the large dataset, it led to a OOM in the System RAM. Easiest method to fix it is to use “4-bit” models (smaller representations), shorten the context length and make a smaller batch size.

Why we chose OpenThoughts: The idea was to take the state-of-the-art general reasoning dataset for the maximum reasoning gains and then do RLFT on our specific task.
We initially chose a 0.5B model to do a simplest check. Then we wanted to see how big a model we can take given our resource constraints, because we were working on a complex task which would not be suitable on a 0.5B model.
Interesting Observation: There wasn’t a huge difference in the training time when we switched from 4B to 8B, keeping the LoRA rank same.

Another extremely important point to remember is that LLMs are in precision bfloat16. It has the same range as float32, but takes 2 bytes (compared to 4 bytes in float32). As a result the parameters are in precision bfloat16. Therefore, the Modern GPUs have components specialised to do math in this format only. For some reason I changed this default config, and experienced a ton of crashing and crazy high training times.

Lastly, In the model loading step, there’s a variable called gpu_memory_utilization. Generally kept between 0.4 to 0.6, this also helps in resolving OOM to some extent.

Hyperparameters and GPU Selection

Also, Please read up on Effective Batch Size here: Unsloth Hyperparameters Guide. The link above is not only for Effective Batch Size but also for Hyperparameters in general. It’s very well written and you should surely go through it.

I’d also like to mention some of my own experiences with hyperparameters.

Batch size was surely very effective in handling OOM.
You can get good GRPO results with only 2 generations (https://arxiv.org/html/2510.00977v1) (Choose either 2 or 4, if you take 3 it’ll upset the batch size divisibility of 16 and increase the number of steps)
Thirdly, and as expected, in my LoRA finetuning I found the lora_rank to be one of the biggest movers of the training time.
Fourthly, you can decrease the training time a little by decreasing the number of save steps and output steps.

PS: There was one strange OOM which we were not able to explain, until Sriram somehow found out about this issue in vLLM: https://github.com/vllm-project/vllm/issues/32324

Which GPU do you need?

The first thing here is that alongside a GPU, the RAM is always specified so you need to understand your requirement accordingly. You might be able to get by with a 7-8B model on a 15GB VRAM and up to 30B on a 40GB (assuming 4bit quantisation as well).

Secondly, there are costs associated with GPUs as well. On Colab we got 100 Compute Units on a Rs.1200/- purchase (International transaction in Dollars, so Credit Card required). The runtime depends on the intensity of your task, but it was roughly 15 A100 Hrs in our case. We also observed that a full run (SFT+RLFT+Eval) could take up to 30 A100 Hrs.

Reward Hacking

When you perform RLFT, there is a whole table you get about rewards, losses etc. That is a gold mine for your own analysis. Another thing I want to mention here is Reward Hacking. When you set certain rewards, the LLM tries to game the system by focusing on getting only the rewards which doesn’t lead to the improvement in performance you wanted.

In our case, we observed in one of the runs the model was outputting all the tags <think> <solution> at the end, instead of having the content properly organised between them as we were giving it partial rewards just for mentioning the tags and negative points for not mentioning them. So it decided to just mention all of them at the end to avoid the penalty. The core message here is that reward hacking will surely happen. You need to observe the outputs and see where it is happening to adjust your reward functions accordingly.

Think About the Whole Process (Especially Eval)

I realised when starting I was not thinking about the whole process. Like obviously if you’re fine tuning a model you’ll have to know how to deploy it and evaluate it. That thinking itself came after a long time. Some might say it’s fine to not think about it in the beginning but I think Eval is too important to ignore.

Actually in many senses, Eval is the main experiment. Therefore in research papers in the Experimental Setup, you find Eval description. We need to get a hold on both methodology and evaluation setup.

Of course, we will finish with Eval. Eval is a job in itself. Just like I’d mentioned at the very beginning it is central to your work design. Initially I was very afraid of it not knowing how it would work, only for someone to later point out, it’s just Inference and Verifiers. The go-to library for a quick eval is lm-harness. It’s quite easy to use, however you can only evaluate on the datasets they already have. If you want to do it on a custom dataset, better to just write it yourself.

I think I realised that saving stuff was important. This is obvious, but I ran the whole pipeline without remembering I needed to save it somehow to be able to run eval on the finetuned model.

Another dumb thing I did was while rerunning the script, I overwrote my old saved models with some experimental stuff. What I learnt from this was you should have timestamps and script names automated into the names under which you save files.

One more dumb thing was burning our own GPUs for Eval. There are services like Openrouter which you can use. This means that if you are working on a project about LLM Eval, you do not need direct GPU Access, you can use these API Services. (Called Model Provider Servers).

Also, while saving stuff there’s a difference between saving it on Google Drive vs. saving it on the runtime. If you have Gemini Pro with 2TB Drive space you can directly save stuff, there’s no issue. However if you do not, then you can still save the LoRA Adapters directly to drive as they just take around half a GB or so. For the full model, you can first save it on the Colab runtime and then download it locally (need to do it immediately before the runtime disconnects). A full model of 0.5B can take around 7GB space.

Trick for Colab: Open a PPT, put it on Slideshow and connect your charger. I’ve used it a lot to run stuff overnight. Also, another trick for accessing a shared google drive folder when you mount your drive in Colab is to add that particular folder as a shortcut in your “My Drive”.

Conclusion

It’s 13 pages on my Google Doc now, if you’re still reading this: Firstly, Thanks a lot for considering this worth your time. I know this has been very long.

I’ve always felt uncomfortable in writing the ending sections or concluding paragraphs. Heheh.

What I would like to mention is that, this project is very close to me as it helped me break a lot of psychological barriers. I used to get nervous about small things such as uploading stuff on github, resolving dependency issues, just coding in general because I’d not done these things before. It also led to my first paper submission in a conference.

I’m grateful that I was forced into doing many of these things out of necessity (majboori) when my partner wasn’t available, which helped me learn a lot and gain confidence.

It took me 6 months to get this fine tuning project going. I had never accessed a GPU remotely or used a VPN. As I sat down to write this down 5 hours ago I connected to my University VPN, accessed the GPU granted by my Professor through the VS Code remote server and put an Eval Code on run (For Evaluating my Synthetic Dataset against standard datasets as a benchmark). This is the first time I’m running a code like this. It took me way more time to setup than what I would’ve liked. A few minutes ago when I checked, my runtime was disconnected, and I don’t know why. My project is still incomplete, with another failed experiment today.

I hope that the people who read this blog find value, and that I was able to facilitate and enable good experiments and interesting ideas.

Please subscribe to support this blog, I have around 6-7 more technical ones planned for the next 6 months :)

Sebastian Raschka, PhD

Thanks for sharing! Sorry, I don't want to provide unsolicited feedback, but since you reached out, I did a quick read and hope you don't mind if I share my thoughts that may or may not be helpful :)

- I think you were training/fine-tuning an already strong model? That's always trickier in practice. Just to debug your setup I would start with a base model to see if you get any gains first.

- Ah yes, custom tags are always tricky. Did you check whether they (e.g., <solution>) were supported by the tokenizer and are not broken down into subword tokens? Also, even if they are supported, the model may not be super familiar with them / has rarely seen them so some pre-training on

- Since you are using a 0.5B model anyways, maybe you'd have fun trying my pure PyTorch implementation of GRPO on your dataset: https://github.com/rasbt/reasoning-from-scratch/blob/main/ch06/01_main-chapter-code/ch06_main.ipynb (the accuracy improves 15% -> 48% there)

- "one more point here about the loss heuristic" I wouldn't focus so much on the loss when doing RL, just make sure there are no crazy spike etc. What's more interesting is the advantage stats, reward averages, entropy and so on

Long Form Thoughts

Discussion about this post

Ready for more?