03. Bellman Equation and Optimality

참고: Markov Decision Process (MDP)

Bellman Equation for MRP

MRP $⟨ S, P, R, γ ⟩$ 에서 value function $V (s_{t})$ 에 대한 Bellman equation은 다음과 같이 정의된다.

V (s) = E [G_{t} ∣ s_{t} = s] = E [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots ∣ s_{t} = s] = E [R_{t + 1} + γ k = 0 \sum \infty γ^{k} R_{t + k + 2} ∣ s_{t} = s] = E [R_{t + 1} + γ V (S_{t + 1}) ∣ s_{t} = s] = s_{t + 1} \sum p (s_{t + 1} ∣ s_{t}) [R (s_{t}, s_{t + 1}) + γ V (s_{t + 1})]

즉, 현재 state $s_{t}$ 에서의 value function은 미래 state $s_{t + 1}$ 의 value function으로 표현할 수 있다 (one-step look ahead).

V (s) = R_{s} + γ s^{'} \in S \sum P_{s s^{'}} V (s^{'})

Bellman Equation in a Matrix Form

V = R + γ P V

V (s_{1}) ⋮ V (s_{n}) = R_{1} ⋮ R_{n} + γ P_{11} ⋮ P_{n 1} \dots ⋱ \dots P_{1 n} ⋮ P_{nn} V (s_{1}) ⋮ V (s_{n})

위 식은 linear equation이므로, 다음과 같이 solution을 구할 수 있다.

V = (I - γ P)^{- 1} R

일반적으로 inverse matrix를 explicit하게 계산하는 경우, computation cost ( $O (n^{3})$ for $n$ states)가 너무 높으므로 DP와 같은 다른 방법을 사용한다.

Bellman Expectation Equation for MDP

위 방법과 유사하게, MDP에서 state-value function $V_{π} (s)$ 와 action-value function $Q_{π} (s, a)$ 에 대한 Bellman expectation equation을 다음과 같이 얻을 수 있다.

V_{π} (s) = E_{π} [R_{t + 1} + γ V_{π} (S_{t + 1}) ∣ S_{t} = s]

Q_{π} (s, a) = E_{π} [R_{t + 1} + γ Q_{π} (S_{t + 1}, A_{t + 1}) ∣ S_{t} = s, A_{t} = a]

Bellman Equation for $V_{π}$ and $Q_{π}$

Bellman expectation equation을 $V_{π}$ 와 $Q_{π}$ 의 관계로써 나타낼 수 있다.

V_{π} (s) = a \in A \sum π (a ∣ s) Q_{π} (s, a)

Q_{π} (s, a) = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} V_{π} (s^{'})

위 식을 이용하면, 아래와 같은 식을 얻는다.

V_{π} (s) = a \in A \sum π (a ∣ s) (R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} V_{π} (s^{'}))

Q_{π} (s, a) = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} (a^{'} \in A \sum π (a^{'} ∣ s^{'}) Q_{π} (s^{'}, a^{'}))

Bellman Expectation Equation in a Matrix Form

V_{π} = R^{π} + γ P^{π} V_{π}

V_{π} (s_{1}) ⋮ V_{π} (s_{n}) = R_{1}^{π} ⋮ R_{n}^{π} + γ P_{11}^{π} ⋮ P_{n 1}^{π} \dots ⋱ \dots P_{1 n}^{π} ⋮ P_{nn}^{π} V_{π} (s_{1}) ⋮ V_{π} (s_{n})

위 식 역시 다음과 같이 solution을 얻을 수 있다.

V_{π} = (I - γ P^{π})^{- 1} R^{π}

Bellman Optimality Equation

Optimal Value Function

Optimal state-value function $V_{*} (s)$ 는 모든 policy들을 고려했을 때, 가장 높은 state-value를 말한다.

V_{*} (s) = π max V_{π} (s)

Optimal action-value function $Q_{*} (s, a)$ 역시 동일하게 정의된다.

Q_{*} (s, a) = π max Q_{π} (s, a)

만약 모든 state $s \in S$ 에 대해, $V_{π_{1}} (s) \geq V_{π_{2}} (s)$ 인 경우, $π_{1} \geq π_{2}$ 라고 표현하며, $π_{1}$ 이 $π_{2}$ 보다 더 나은 (better) policy라고 말한다.

만약 모든 policy에 대해 더 나은 policy $π_{*}$ 가 있다면, 이를 optimal policy라고 한다.

Theorem

Optimal policy $π_{*}$ 는 항상 존재한다.

Optimal policy $π_{*}$ 를 통해 계산된 state-value $V_{π_{*}} (s)$ 와 $Q_{π_{*}} (s, a)$ 는 각각 optimal state-value 및 action-value 이다. 즉, $V_{π_{*}} (s) = V_{*} (s)$ , $Q_{π_{*}} (s, a) = Q_{*} (s, a)$ .

Finding an Optimal Policy

만약 optimal action-value function $Q_{*} (s, a)$ 를 안다면, optimal policy를 다음과 같이 바로 얻을 수 있다.

π_{*} (a ∣ s) = {10 if a = ar g a \in A max Q_{*} (s, a) otherwise

즉, $Q_{*} (s, a)$ 가 가장 큰 action을 고르는 것이 optimal policy가 된다.

Bellman Optimality Equation for $V_{} (s)$ and $Q_{} (s, a)$

Bellman optimality equation은 $V_{*} (s)$ 와 $Q_{*} (s, a)$ 에 대한 iterative equation을 제시하며, 이를 풀어냄으로써 optimal value function 및 optimal policy를 얻을 수 있다.

앞서 Bellman expectation equation과 마찬가지로 $V_{*} (s)$ 와 $Q_{*} (s, a)$ 관계를 다음과 같이 표현할 수 있다.

V_{*} (s) = a \in A max Q_{*} (s, a)

Q_{*} (s, a) = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} V_{*} (s^{'})

위 식을 이용하면, 다음 Bellman optimality equation을 얻을 수 있다.

V_{π} (s) = a \in A max (R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} V_{*} (s^{'}))

Q_{π} (s, a) = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} a^{'} \in A max Q_{*} (s^{'}, a^{'})

Bellman optimality equation은 non-linear 하므로 단순한 matrix 연산으로 solution을 구할 수는 없다. 따라서, iterative alogirthm을 통해 solution을 구하게 된다. 자세한 방법에 대해서는 다음 post에서 다룰 예정이다.

Roh Donghyun

Explorer

03. Bellman Equation and Optimality

Bellman Equation for MRP

Bellman Equation in a Matrix Form

Bellman Expectation Equation for MDP

Bellman Equation for $V_{π}$ and $Q_{π}$

Bellman Expectation Equation in a Matrix Form

Bellman Optimality Equation

Optimal Value Function

Finding an Optimal Policy

Bellman Optimality Equation for $V_{} (s)$ and $Q_{} (s, a)$

Graph View

Table of Contents

Roh Donghyun

Explorer

03. Bellman Equation and Optimality

Bellman Equation for MRP

Bellman Equation in a Matrix Form

Bellman Expectation Equation for MDP

Bellman Equation for Vπ​ and Qπ​

Bellman Expectation Equation in a Matrix Form

Bellman Optimality Equation

Optimal Value Function

Finding an Optimal Policy

Bellman Optimality Equation for V∗​(s) and Q∗​(s,a)

Graph View

Table of Contents

Bellman Equation for $V_{π}$ and $Q_{π}$

Bellman Optimality Equation for $V_{} (s)$ and $Q_{} (s, a)$