
Motivating Examples
return: The (discounted) sum of the rewards obtained along a trajectory.

Return could be used to evaluate policies.
- Based on policy 1:
- Based on policy 2:
- Based on policy 3:
In summary, starting from
Calculating return is important to evaluate a policy
计算Return

Method 1:
Method 2:
Write in the following matrix-vector form:
which can be rewritten as the Bellman equation
State Value
single-step process
- : discrete time instances
- : state at time
- : the action taken at state
- : the reward obtained after taking (有时也记作,但习惯上一般记作)
- : the state transited to after taking
Note that are all random variables. (random variables 的意思是可以求一系列操作,比如 expectation)
This step is governed by the following probability distributions:
- is governed by
- is governed by
- is governed by
At this moment, we assume we know the model (i.e., the probability distributions)!
multi-step trajectory
The discounted return is
- in is a discount rate.
- is also a random variable since are random variables.
Definition
The expectation (or called expected value or mean) of is defined as the state-value function or simply state value:
Remarks:
- It is a function of . It is a conditional expectation with the condition that the state starts from .
- It is based on the policy . For a different policy, the state value may be different.
- It represents the “value” of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained. state value 不仅仅是一个数值的 value,也可以代表一个“价值”,也就是当这个 state value 比较大的时候,就代表这个状态是比较有价值的,因为从这个状态出发,可以得到更多的 return。
The relationship between return and state value
return 是针对单个 trajectory 而言的,而 state value 则是多个 return 求平均值;如果从一个状态 出发,可以得到多个 ,那么此时和 state value 就是有区别的
Bellman Equation: Derivation
Bellman equation 描述了不同状态之间 state value 的关系
Consider a random trajectory:
The return can be written as:
Then, it follows from the definition of the state value that
First, calculate the first term
This is the mean of immediate rewards.
Second, calculate the second term
- This is the mean of future rewards.
- due to memoryless Markov property
Bellman Equation
Highlights:
- Bellman Equation 描述了不同 state value 之间的关系
- 包括两部分:immediate reward term 和 future reward term
- Bellman Equation 是一系列 equations,不是一个式子
- 通过 bootstrapping 计算 和
- 是给定的 policy,bellman equation依赖于policy,因此 解Bellman Equation 被称为 evaluate policy
- and represent the dynamic model,在这里假设我们知道这个model,后续有不知道的求解方法
Bellman equation: Matrix-vector form
对Bellman Equation
变形为:
其中
将states记为
对state 有
联立这些等式,重写为矩阵形式:
其中
, where , is the state transition matrix
Example

Bellman equation: Solve the state values
- Why to solve Bellman Equation? 对于给定的policy,只有找到其对应的state value,才能进行policy evaluation,即对policy进行评估,后续才能根据结果进行改进。
The Closed-form solution
实际我们并不会采用这个方法,当状态空间比较大的时候,这个矩阵的维数很大,求逆的计算量很大。
An iterative solution
将代入得到,将代入得到,以此类推,可以证明:
proof

Action Value
- State value: the average return the agent can get starting from a state
- Action value: the average return the agent can get starting from a state and taking an action
Definition
- 是一个 state-action pair 的函数
- 依赖于策略
Hence
- (1) shows how to obtain state values from action values
- (2) shows how to obtain action values from state values
Example


- 可以先将所有的 state value 都计算出来,再计算 action value;
- 也可以不用计算 state value ,直接计算 action value ,比如通过数据的方式,此时就不再依赖于这个模型了






