[CS285] 7. Value Function Methods

Date: 2024.04.23 Updated: 2024.04.23

카테고리: CS285

태그: RL

💡 이 글은 『2024 PseudoLab 전반기 강화학습팀』으로 진행되었으며, CS285 Fall 2023를 따라 정리했습니다.

1. Value Iteration

1.1. Policy Iteration

6장에서 배운 actor-critic algorithm은 다음과 같이 정리할 수 있다.

Sample $\lbrace s_i, a_i \rbrace$ from $\pi _ \theta (a \vert s)$
Fit $\hat{V}_ \phi ^ \pi (s_i)$ to sampled reward sums
Evaluate $\hat{A} ^ \pi (s_i, a_i) = r(s_ i, a_ i) + \hat{V}_ \phi ^ \pi (s_ i ^ \prime) - \hat{V}_ \phi ^ \pi (s_ i)$
$\nabla_ \theta J(\theta) \approx \sum_ i \nabla_ \theta \log \pi_ \theta (a_ i \vert s_ i) \hat{A} ^ \pi (s_ i, a_ i)$
$\theta \leftarrow \theta + \alpha \nabla_ \theta J(\theta)$

이번 장에서는 value function methods에 대해 다룬다. 이는 policy gradient를 생략하고, value function만 학습하는 방법이다. 즉, 여기서는 $A ^ \pi (s_t, a_t)$가 가장 큰 action $a_t$를 선택하는 argmax policy를 사용한다.

\[\pi ^ \prime (a_t \vert s_t) = \begin{cases} 1 & \text{if } a_t = \text{argmax}_ {a_t} A^ \pi (s_t, a_t) \\ 0 & \text{otherwise} \end{cases}\]

이러한 방법을 policy iteration algorithm이라고 한다. 알고리즘을 정리하면 다음과 같다.

Evaluate $A^ \pi (s, a)$
Set $\pi \leftarrow \pi ^ \prime$ (Policy Iteration)

문제는 모든 $s, a$에 대해 $A^ \pi (s, a)$를 계산(evaluate)하는 것이 쉽지 않다는 것인데, $A^ \pi (s, a)$는 actor-critic에서와 같이 다음과 같이 나타낼 수 있다.

\[A^ \pi (s, a) = r(s, a) + \gamma \mathbb{E} \left[ V^ \pi (s ^ \prime) \right] - V^ \pi (s)\]

따라서 actor-critic에서와 같이 $V^ \pi (s)$를 평가(evaluate)하면 이를 해결할 수 있다. 이때 가장 간단한 방법은 dynamic programming을 사용하는 것이다.

1.2. Dynamic Programming

Dynamic programming에서는 value function $V^ \pi (s)$의 모든 값을 가상의 table에 모두 저장한다. 이를 위해서 필요한 가정은 모든 $p(s^ \prime \vert s, a)$를 알고 있으며, $s, a$가 모두 discrete하고, 그 가짓수가 적다는 것이다. 이 가정은 일반적으로는 불가능하므로, 나중에 이를 확장할 것이다.

이 가정 하에서는 $V^ \pi (s)$를 다음과 같이 업데이트할 수 있다.

\[V^ \pi (s) \leftarrow \mathbb{E}_ {a \sim \pi (a \vert s)} \left[ r(s, a) + \gamma \mathbb{E}_ {s ^ \prime \sim p(s ^ \prime \vert s, a)} \left[ V^ \pi (s ^ \prime) \right] \right]\]

한편, 우리는 deterministic policy $\pi (s) = a$를 가지고 있으므로, 즉 $a \sim \pi (a \vert s)$에서 하나의 action만 얻게 되므로 위 식을 더 간단하게 다음과 같이 나타낼 수 있다.

\[V^ \pi (s) \leftarrow r(s, \pi (s)) + \gamma \mathbb{E}_ {s ^ \prime \sim p(s ^ \prime \vert s, \pi (s))} \left[ V^ \pi (s ^ \prime) \right]\]

1.3. Value Iteration

이를 더 간단하게 만든 것이 value iteration algorithm이다. 이는 policy를 아예 생략하는 것이다. 작동 원리를 이해하기 위해 먼저 $A^ \pi (s, a)$와 $Q^ \pi (s, a)$ 식을 다시 보자.

\[\begin{aligned} A^ \pi (s, a) &= r(s, a) + \gamma \mathbb{E}_ {s ^ \prime \sim p(s ^ \prime \vert s, a)} \left[ V^ \pi (s ^ \prime) \right] - V^ \pi (s) \\ Q^ \pi (s, a) &= r(s, a) + \gamma \mathbb{E}_ {s ^ \prime \sim p(s ^ \prime \vert s, a)} \left[ V^ \pi (s ^ \prime) \right] \\ \end{aligned}\]

$V ^ \pi (s)$ 외에는 값이 같으므로, 다음과 같이 쓸 수 있다.

\[\arg \max_ {a_t} A^ \pi (s_t, a_t) = \arg \max_ {a_t} Q^ \pi (s_t, a_t)\]

따라서 조금 더 간단한 $Q^ \pi (s, a)$를 사용해서도 동일한 policy iteration을 수행할 수 있다. 여기서 한 걸음 더 나아가서 $Q^ \pi (s, a)$를 일종의 $s, a$에 대한 matrix로 이해해보자.

지금까지의 argmax policy에 따르면 $s$가 주어졌을 때 $Q^ \pi (s, a)$의 최대값을 가지는 $a$를 선택하면 된다. 따라서 이러한 policy를 따른다면 value function을 다음과 같이 업데이트할 수 있다.

\[V^ \pi (s) \leftarrow \max_ {a} Q^ \pi (s, a)\]

이러한 방법을 value iteration algorithm이라고 한다. 이는 Q function을 사용하고, policy를 생략하는 방법이다.

Set $Q(s, a) \leftarrow r(s, a) + \gamma \mathbb{E} \left[ V(s ^ \prime) \right]$
Set $V(s) \leftarrow \max_ {a} Q(s, a)$ (Value Iteration)

명시적으로 policy가 포함되어 있지는 않으나, $\arg \max_ a Q(s, a)$를 통하여 policy를 얻을 수 있다. 즉, $Q(s, a)$ 내부에 policy 정보가 포함되어 있는 셈이다.

2. Fitted Value Iteration

2.1. Fitted Value Iteration

지금까지 dynamic programming의 두 가지 중요한 가정이 있었다. 첫째는 $s, a$가 small, discrete하다는 것이고, 둘째는 모든 transition dynamics $p(s^ \prime \vert s, a)$를 알고 있다는 것이다. 첫 번째 가정은 curse of dimensionality 때문에 문제가 되고, 두 번째 가정은 실제로 불가능하다. 먼저 첫 번째 가정의 문제를 해결하기 위하여 value function을 table로 저장하는 것이 아니라, neural network를 사용하여 $V(s)$를 학습하는 방법을 생각해보자. 이를 function approximation이라고 하며, “fitted”라는 단어를 붙여 fitted value iteration이라고 부른다.

\[\mathcal{L} (\phi) = \frac{1}{2} \left\Vert V_ \phi (s) - \max_ {a} Q^ \pi (s, a) \right\Vert ^ 2\]

위와 같은 objective를 사용하여 학습하는 알고리즘을 만들자.

Set $y_i \leftarrow \max_ {a_i} \left(r(s_i, a_i) + \gamma \mathbb{E} \left[ V_ \phi (s_i ^ \prime) \right] \right)$
Set $\phi \leftarrow \arg \min_ \phi \frac{1}{2} \sum_i \left\Vert V_ \phi (s_i) - y_i \right\Vert ^ 2$

2.2. Fitted Q-Iteration

위 과정을 통하여 첫 번째 문제를 해결했다. 두 번째 문제는 transition dynamics를 알지 못한다는 것인데, 이 문제는 value function과 Q-function의 식을 가만히 들여다보면 해결할 수 있다.

\[\begin{aligned} V^ \pi (s) &\leftarrow r(s, \pi(s)) + \gamma \mathbb{E}_ {s ^ \prime \sim p(s ^ \prime \vert s, \pi(s))} \left[ V^ \pi (s ^ \prime) \right] \\ Q^ \pi (s, a) &\leftarrow r(s, a) + \gamma \mathbb{E}_ {s ^ \prime \sim p(s ^ \prime \vert s, a)} \left[ Q^ \pi (s ^ \prime, \pi(s ^ \prime)) \right] \\ \end{aligned}\]

Policy evaluation에서 지금까지는 $V^ \pi (s)$를 계산하기 위하여 transition dynamics를 알아야 했지만, Q-function을 사용하면 선택한 행동 $a$에 대해서만 고려하면 된다. 즉, $s, a$를 얻으면 $s ^ \prime$은 transition dynamics 및 policy와 무관하게 결정되기 때문에 $(s, a, s^ \prime)$을 샘플링하여 value function을 업데이트할 수 있다. 알고리즘은 fitted value iteration에서의 $\mathbb{E} \left[V(s ^ \prime) \right]$를 $\max_ {a ^ \prime} Q_ \phi (s _ i ^ \prime, a _ i ^ \prime)$로 근사하면 된다.

이러한 fitted Q-iteration의 장점은 off-policy가 가능하다는 것이다. 또한 network도 하나뿐이고, variance가 큰 policy gradient도 사용하지 않는다. 그러나 수렴을 보장하지는 않는다. 자세한 이야기는 다음 절에서 다루겠다.

3. Q-Learning

3.1. Error $\mathcal{E}$

Fitted Q-iteration에서 error $\mathcal{E}$를 다음과 같이 정의하자.

\[\mathcal{E} = \frac{1}{2} \mathbb{E}_ {(s, a) \sim \beta} \left[ \left( Q_ \phi (s, a) - \left( r(s, a) + \gamma \max_ {a ^ \prime} Q_ \phi (s ^ \prime, a ^ \prime) \right) \right) ^ 2 \right]\]

만약 $\mathcal{E} = 0$이라면, $Q_ \phi (s, a)$는 optimal Q-function이고, 정확히 optimal policy $\pi^ \prime$이 내재되어 있다.

\[\pi^ \prime (a \vert s) = \begin{cases} 1 & \text{if } a = \arg \max_ {a} Q_ \phi (s, a) \\ 0 & \text{otherwise} \end{cases}\]

그러나 이는 Q-function을 table로 가지고 있을 때의 이야기이고, 지금은 Q-function을 neural network로 모델링하였으므로 $\mathcal{E} \gt 0$이다. 즉, optimal Q-function으로의 수렴을 보장할 수 없다.

3.2. Online Q-Learning

Online Q-learning algorithms는 fitted Q-iteration의 아주 특수한 형태로 하나의 transition을 관찰하여 업데이트를 진행한다. 그 알고리즘은 다음과 같다.

이 알고리즘은 off-policy이기 때문에 $(s_i, a_i, s_i ^ \prime, r_i)$로 가능한 데이터는 다양하다. 한편, 이러한 방식으로 업데이트를 진행할 경우 initial Q-function이 좋지 않다면 better action discovery에 실패할 수 있다. 이를 해결하기 위하여 epsilon greedy, 혹은 Boltzmann exploration을 사용할 수 있다. 두 방법 모두 exploration을 통하여 argmax가 아닌 다른 action을 활용할 수 있게 한다.

4. Value Functions in Theory

4.1. Convergence of Value Iteration

마지막으로 value iteration algorithm이 수렴하는지, 수렴한다면 어디로 수렴하는지 알아보자. Value iteration algorithm을 다시 보면 다음과 같다.

Set $Q(s, a) \leftarrow r(s, a) + \gamma \mathbb{E} \left[ V(s ^ \prime) \right]$
Set $V(s) \leftarrow \max_ {a} Q(s, a)$ (Value Iteration)

이를 수학적으로 정의하기 위해 Bellman operator $\mathcal{B}$를 정의하자. 결국 위 step을 합쳐 value iteration을 수행하는 것을 나타낸 것에 불과하다.

\[\mathcal{B} V = \max_ {a} \left[ r_a + \gamma \mathcal{T}_ a V \right]\]

당연히 optimal policy의 value function $V^ \star$은 $\mathcal{B}$의 fixed point, 즉 $\mathcal{B} V^ \star = V^ \star$이다. 이러한 $V^ \star$는 uniquely exists하다는 것이 알려져 있다.

$V$가 언젠가는 $V^ \star$로 수렴하는 것도 알려져 있는데, 이는 $\mathcal{B}$가 contraction이기 때문이다. Contraction은 임의의 $V, \bar{V}$에 대해 다음이 성립한다는 것이다.

\[\left\Vert \mathcal{B} V - \mathcal{B} \bar{V} \right\Vert _ \infty \leq \gamma \left\Vert V - \bar{V} \right\Vert _ \infty\]

$\bar{V} = V^ \star$로 두면 다음이 성립한다.

\[\left\Vert \mathcal{B} V - V^ \star \right\Vert _ \infty \leq \gamma \left\Vert V - V^ \star \right\Vert _ \infty\]

따라서 $V$가 $V^ \star$로 수렴한다는 것을 알 수 있다.

4.2. Convergence of Fitted Value Iteration

한편 fitted value iteration algorithm을 다시 보면 다음과 같다.

Set $y_i \leftarrow \max_ {a_i} \left(r(s_i, a_i) + \gamma \mathbb{E} \left[ V_ \phi (s_i ^ \prime) \right] \right)$
Set $\phi \leftarrow \arg \min_ \phi \frac{1}{2} \sum_i \left\Vert V_ \phi (s_i) - y_i \right\Vert ^ 2$

이때 $y_i = (\mathcal{B} V) (s_i)$로 볼 수 있고, 전체 과정을 다음과 같이 나타낼 수 있다.

\[V^ \prime \leftarrow \arg \min_ {V ^ \prime \in \Omega} \frac{1}{2} \left\Vert V^ \prime (s) - (\mathcal{B} V) (s) \right\Vert ^ 2\]

여기서 $\Omega$는 neural network가 나타낼 수 있는 value function space를 나타낸다. 이 과정을 operator $\Pi$로 나타내자.

\[\Pi V = \arg \min_ {V ^ \prime \in \Omega} \frac{1}{2} \left\Vert V^ \prime (s) - V(s) \right\Vert ^ 2\]

종합하면 fitted value iteration algorithm은 $V \leftarrow \Pi \mathcal{B} V$로 나타낼 수 있다. 수학적으로 $\Pi$는 $\Omega$로의 projection으로 이해할 수 있다. 아래 그림이 직관적이다.

$\mathcal{B}$와 $\Pi$ 모두 각각은 contraction임이 알려져 있다.

그러나 아래의 그림을 통해 이해할 수 있듯이 $\Pi \mathcal{B}$는 contraction이 아니다. 이는 fitted value iteration algorithm이 수렴하지 않을 수 있다는 것을 의미한다.

마찬가지 이유로 fitted Q iteration 및 actor-critic algorithm도 수렴이 보장되지 않는다. Q-learning의 경우에는 알고리즘 상 gradient descent가 진행되므로 수렴하는 것이 아니냐고 물어볼 수도 있다.

그러나 실제로는 $y_i$가 $\phi$에 의존하는데도 불구하고 gradient에 반영되지 않기 때문에, Q-learning은 gradient descent로 볼 수 없다. 따라서 Q-learning도 수렴이 보장되지 않는다. 이처럼 이론적으로는 수렴이 잘 보장되지 않더라도 실제로는 잘 작동할 수 있도록 할 수 있는데, 이에 대해서는 다음 시간에 다루겠다.

I'm rubatoyeong