LLM & Agent 每日论文阅读计划-D3
- 内容介绍
- 文章标签
- 相关推荐
问题描述:
LLM & Agent 每日论文阅读计划-D3
碎碎念
最近比较忙也是好几天没更新。本来计划第三天去看GRPO的。但是发现牵扯出了PPO、RLVR、RLAIF、RLHF等一大堆我不会的强化学习内容,于是打算从大模型用的强化学习算法开始从头看起。后面继续更新慢慢深入。
今日阅读主题:大模型的强化学习微调
- 参考资料:
- Illustrating Reinforcement Learning from Human Feedback (RLHF) - HuggingFace Blog, 2022.12
- Post-Training Techniques 2026 - LLM Stats Blog, 2026.03
- GRPO: the RL Algorithm Behind DeepSeek-R1 - Cameron R. Wolfe, 2025
- The State of LLM Reasoning Model Training - Sebastian Raschka, 2025
- Understanding Reasoning LLMs - Sebastian Raschka, 2025
- Reward Hacking in Reinforcement Learning - Lilian Weng, 2024.11
- AI 101: The State of Reinforcement Learning in 2025 - Turing Post, 2025.12
TL;DR
大模型强化学习微调从 RLHF(Reinforcement Learning from Human Feedback,基于人类反馈的强化学习)出发,经历了 RLHF→RLAIF→RLVR 的范式转变。
问题描述:
LLM & Agent 每日论文阅读计划-D3
碎碎念
最近比较忙也是好几天没更新。本来计划第三天去看GRPO的。但是发现牵扯出了PPO、RLVR、RLAIF、RLHF等一大堆我不会的强化学习内容,于是打算从大模型用的强化学习算法开始从头看起。后面继续更新慢慢深入。
今日阅读主题:大模型的强化学习微调
- 参考资料:
- Illustrating Reinforcement Learning from Human Feedback (RLHF) - HuggingFace Blog, 2022.12
- Post-Training Techniques 2026 - LLM Stats Blog, 2026.03
- GRPO: the RL Algorithm Behind DeepSeek-R1 - Cameron R. Wolfe, 2025
- The State of LLM Reasoning Model Training - Sebastian Raschka, 2025
- Understanding Reasoning LLMs - Sebastian Raschka, 2025
- Reward Hacking in Reinforcement Learning - Lilian Weng, 2024.11
- AI 101: The State of Reinforcement Learning in 2025 - Turing Post, 2025.12
TL;DR
大模型强化学习微调从 RLHF(Reinforcement Learning from Human Feedback,基于人类反馈的强化学习)出发,经历了 RLHF→RLAIF→RLVR 的范式转变。

