<em>Simpler is Better:</em> Finding the Best Reward Function in Long Chain-of-Thought Reinforcement Learning for Small Language ModelsApr 27, 2025Go to Project Site PDF Poster Code Slides Proposed PipelineNLP LLM MLZichen "Charlie" ZhangBuilding AI wearables @ Halo | Ex AI @ SupercellI’m passionate about transforming traditional softwares with AI