强化学习
Reinforcement Learning RL
机器学习的一个重要分支,通过让智能体(Agent)在环境中(Environment)进行试错(Trial and Error),并根据反馈来学习最优行为策略,以最大化累积奖励(Cumulative Reward)。
强化学习是通过与环境交互、积累经验、优化策略,以最大化累计奖励的过程。
http://incompleteideas.net/book/the-book-2nd.html
https://github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning
1. 强化学习的基本概念
强化学习的过程可以用一个五元组来描述:
智能体的目标是学会一个策略(policy),使得累计期望回报最大:
| 符号 | 名称 | 含义 |
|---|---|---|
| 状态空间 | 环境当前的观测 | |
| 动作空间 | 智能体能在当前状态下采取的行为 | |
| 状态转移函数 | 给定状态和动作后转移到下一个状态的概率,用于刻画环境的动态特性。 | |
| 奖励函数 | 衡量智能体动作好坏的指标,即它能从环境中获得多少即时奖励。 | |
| 折扣因子 | 衡量未来奖励的重要性,取值在 |
2. 强化学习的工作原理
强化学习的核心是让智能体通过与环境的交互来学习最优策略。其基本流程如下:
- 初始化:智能体处于初始状态。
- 选择动作:根据当前策略选择一个动作。
- 执行动作:智能体执行动作,环境根据动作给出新的状态和奖励。
- 更新策略:根据奖励和新的状态,智能体更新策略,以提高未来获得的累积奖励。
- 重复:重复上述过程,直到达到目标状态或满足终止条件。
3. 强化学习的主要算法
强化学习的算法可以分为两大类:基于价值的算法和基于策略的算法。
基于价值的算法(Value-Based Methods)
基于策略的算法(Policy-Based Methods)
结合价值与策略的算法 Actor-Critic
| 算法类别 | 代表算法 | 特点 |
|---|---|---|
| 基于值(Value-based) | Q-Learning DQN |
学习Q函数,动作通过Q值最大化选择 |
| 基于策略(Policy-based) | REINFORCE PPO |
直接学习策略分布,适合高维动作空间 |
| Actor-Critic方法 | A2C A3C DDPG SAC |
同时学习策略和值函数,收敛更快更稳定 |
Part I: Tabular Solution Methods
the state and action spaces are small enough for the approximate value functions to be represented as arrays, or tables
Finite Markov Decision Processes
Dynamic Programming: are well developed mathematically, but require a complete and accurate model of the environment.
Monte Carlo Methods: don’t require a model and are conceptually simple, but are not well suited for step-by-step incremental computation.
Temporal Difference Learning: require no model and are fully incremental, but are more complex to analyze
Part I: Tabular Solution Methods
In this part of the book we describe almost all the core ideas of reinforcement learning algorithms in their simplest forms: that in which the state and action spaces are small enough for the approximate value functions to be represented as arrays, or tables. In this case, the methods can often find exact solutions, that is, they can often find exactly the optimal value function and the optimal policy. This contrasts with the approximate methods described in the next part of the book, which only find approximate solutions, but which in return can be applied effectively to much larger problems.
The first chapter of this part of the book describes solution methods for the special case of the reinforcement learning problem in which there is only a single state, called bandit problems. The second chapter describes the general problem formulation that we treat throughout the rest of the book—finite Markov decision processes—and its main ideas including Bellman equations and value functions.
The next three chapters describe three fundamental classes of methods for solving finite Markov decision problems: dynamic programming, Monte Carlo methods, and temporal- difference learning. Each class of methods has its strengths and weaknesses. Dynamic programming methods are well developed mathematically, but require a complete and accurate model of the environment. Monte Carlo methods don't require a model and are conceptually simple, but are not well suited for step- by- step incremental computation. Finally, temporal- difference methods require no model and are fully incremental, but are more complex to analyze. The methods also differ in several ways with respect to their efficiency and speed of convergence.
The remaining two chapters describe how these three classes of methods can be combined to obtain the best features of each of them. In one chapter we describe how the strengths of Monte Carlo methods can be combined with the strengths of temporal- difference methods via multi- step bootstrapping methods. In the final chapter of this part of the book we show how temporal- difference learning methods can be combined with model learning and planning methods (such as dynamic programming) for a complete and unified solution to the tabular reinforcement learning problem.