<em>Simpler is Better:</em> Finding the Best Reward Function in Long Chain-of-Thought Reinforcement Learning for Small Language Models

Proposed Pipeline
Zichen "Charlie" Zhang
Zichen "Charlie" Zhang

I’m passionate about transforming traditional softwares with AI