notion image

Motivating Examples

return: The (discounted) sum of the rewards obtained along a trajectory.
notion image
Return could be used to evaluate policies.
  • Based on policy 1:
    • Based on policy 2:
      • Based on policy 3:
        In summary, starting from
        Calculating return is important to evaluate a policy

        计算Return

        notion image
        Method 1:
        Method 2:
        Write in the following matrix-vector form:
        which can be rewritten as the Bellman equation

        State Value

        single-step process

        • : discrete time instances
        • : state at time
        • : the action taken at state
        • : the reward obtained after taking (有时也记作,但习惯上一般记作
        • : the state transited to after taking
        Note that are all random variables. (random variables 的意思是可以求一系列操作,比如 expectation)
        This step is governed by the following probability distributions:
        • is governed by
        • is governed by
        • is governed by
        At this moment, we assume we know the model (i.e., the probability distributions)!

        multi-step trajectory

        The discounted return is
        • in is a discount rate.
        • is also a random variable since are random variables.

        Definition

        The expectation (or called expected value or mean) of is defined as the state-value function or simply state value:
        Remarks:
        • It is a function of . It is a conditional expectation with the condition that the state starts from .
        • It is based on the policy . For a different policy, the state value may be different.
        • It represents the “value” of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained. state value 不仅仅是一个数值的 value,也可以代表一个“价值”,也就是当这个 state value 比较大的时候,就代表这个状态是比较有价值的,因为从这个状态出发,可以得到更多的 return。

        The relationship between return and state value

        return 是针对单个 trajectory 而言的,而 state value 则是多个 return 求平均值;如果从一个状态 出发,可以得到多个 ,那么此时和 state value 就是有区别的

        Bellman Equation: Derivation

        Bellman equation 描述了不同状态之间 state value 的关系 Consider a random trajectory:
        The return can be written as:
        Then, it follows from the definition of the state value that
        First, calculate the first term
        This is the mean of immediate rewards.
        Second, calculate the second term
        • This is the mean of future rewards.
        • due to memoryless Markov property

        Bellman Equation

        Highlights:

        1. Bellman Equation 描述了不同 state value 之间的关系
        1. 包括两部分:immediate reward term 和 future reward term
        1. Bellman Equation 是一系列 equations,不是一个式子
        1. 通过 bootstrapping 计算
        1. 是给定的 policy,bellman equation依赖于policy,因此 解Bellman Equation 被称为 evaluate policy
        1. and represent the dynamic model,在这里假设我们知道这个model,后续有不知道的求解方法

        Bellman equation: Matrix-vector form

        对Bellman Equation
        变形为:
        其中
        将states记为
        对state
        联立这些等式,重写为矩阵形式:
        其中
        , where , is the state transition matrix

        Example

        notion image

        Bellman equation: Solve the state values

        • Why to solve Bellman Equation? 对于给定的policy,只有找到其对应的state value,才能进行policy evaluation,即对policy进行评估,后续才能根据结果进行改进。

        The Closed-form solution

        实际我们并不会采用这个方法,当状态空间比较大的时候,这个矩阵的维数很大,求逆的计算量很大。

        An iterative solution

        代入得到,将代入得到,以此类推,可以证明:

        proof

        notion image

        Action Value

        • State value: the average return the agent can get starting from a state
        • Action value: the average return the agent can get starting from a state and taking an action
         

        Definition

        • 是一个 state-action pair 的函数
        • 依赖于策略
        Hence
        • (1) shows how to obtain state values from action values
        • (2) shows how to obtain action values from state values

        Example

        notion image
          • notion image
         
        • 可以先将所有的 state value 都计算出来,再计算 action value;
        • 也可以不用计算 state value ,直接计算 action value ,比如通过数据的方式,此时就不再依赖于这个模型了
         
        Loading...