Lec 02 Bellman Equation | Ashes’ Blog

Motivating Examples

return: The (discounted) sum of the rewards obtained along a trajectory.

Return could be used to evaluate policies.

Based on policy 1:

Based on policy 2:

Based on policy 3:

In summary, starting from

Calculating return is important to evaluate a policy

计算Return

Method 1:

Method 2:

Write in the following matrix-vector form:

which can be rewritten as the Bellman equation

State Value

single-step process

: discrete time instances

: state at time

: the action taken at state

: the reward obtained after taking （有时也记作，但习惯上一般记作）

: the state transited to after taking

Note that are all random variables. （random variables 的意思是可以求一系列操作，比如 expectation（求期望））

This step is governed by the following probability distributions:

is governed by （是由策略决定的）

is governed by （是由 reward probability 决定的）

is governed by （是由 state transition probability 决定的）

At this moment, we assume we know the model (i.e., the probability distributions)!

multi-step trajectory

The discounted return is

in is a discount rate.

is also a random variable since are random variables.

Definition

The expectation (or called expected value or mean) of is defined as the state-value function or simply state value:

就是在给定状态下，得到的均值。

Remarks:

It is a function of . It is a conditional expectation with the condition that the state starts from .

It is based on the policy . For a different policy, the state value may be different.

It represents the “value” of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained. state value 不仅仅是一个数值的 value，也可以代表一个“价值”，也就是当这个 state value 比较大的时候，就代表这个状态是比较有价值的，因为从这个状态出发，可以得到更多的 return。

The relationship between return and state value

return 是针对单个 trajectory 而言的，而 state value 则是多个 return 求平均值；如果从一个状态出发，可以得到多个，那么此时和 state value 就是有区别的

Bellman Equation: Derivation

Bellman equation 描述了不同状态之间 state value 的关系 Consider a random trajectory:

The return can be written as:

Then, it follows from the definition of the state value that

First, calculate the first term

This is the mean of immediate rewards.

Second, calculate the second term

This is the mean of future rewards.

due to memoryless Markov property

Bellman Equation

Highlights:

Bellman Equation 描述了不同 state value 之间的关系

包括两部分：immediate reward term 和 future reward term

Bellman Equation 是一系列 equations，不是一个式子

通过 bootstrapping 计算和

是给定的 policy，bellman equation依赖于policy，因此解Bellman Equation 被称为 evaluate policy

and represent the dynamic model，在这里假设我们知道这个model，后续有不知道的求解方法

Bellman equation: Matrix-vector form

对Bellman Equation

变形为：

其中

将states记为

对state 有

联立这些等式，重写为矩阵形式：

其中

, where , is the state transition matrix

Example

Bellman equation: Solve the state values

Why to solve Bellman Equation? 对于给定的policy，只有找到其对应的state value，才能进行policy evaluation，即对policy进行评估，后续才能根据结果进行改进。

The Closed-form solution

实际我们并不会采用这个方法，当状态空间比较大的时候，这个矩阵的维数很大，求逆的计算量很大。

An iterative solution

将代入得到，将代入得到，以此类推，可以证明：

proof

Action Value

State value: the average return the agent can get starting from a state

Action value: the average return the agent can get starting from a state and taking an action

Definition

是一个 state-action pair 的函数

依赖于策略

Hence

(1) shows how to obtain state values from action values

(2) shows how to obtain action values from state values

Example

可以先将所有的 state value 都计算出来，再计算 action value；

也可以不用计算 state value ，直接计算 action value ，比如通过数据的方式，此时就不再依赖于这个模型了