<em>Simpler is Better:</em> Finding the Best Reward Function in Long Chain-of-Thought Reinforcement Learning for Small Language Models

Proposed Pipeline
Zichen "Charlie" Zhang
Zichen "Charlie" Zhang
Building AI wearables @ Halo | Ex AI @ Supercell

I’m passionate about transforming traditional softwares with AI