(yet) Offline Actor-Critic (Perceiver Actor-Critic; PAC) Algorithm
12 Feb 2024< 목차 >
Introduction (Motivation)
DeepMind에서 Offline Actor-Critic RL algorithm의 transformer같은 large model에 대한 scaling laws를 밝혀냈다고 한다. 132개 continuous control task (robotics같은)에 대한 large scale sub-optimal or expert behavior dataset이 주어졌을 때 Behavior Cloning (BC)보다 훨씬 좋았다, Proposed method는 Perciever 기반의 Actor-Critic method인데, perceiver가 여러 image, text, game등의 modality를 모두 input으로 받을 수 있도록 design된 architecture인 만큼 multi-task policy를 학습한 것 같다.
그래서 algorithm이름이 Perceiver Actor-Critic (PAC)인데,
사실 나는 multimodality에 관심이 있다기 보다는 모든 sacle에서 BC보다 더 우월할 수 있게 만들어준 learning algorithm 자체가 궁금했다.
Actor-Critic이므로 우리는 자연스럽게 critic을 design하는데 있어 몇 가지 option이 있을텐데,
paper에서는 state-action value function (Q-function)을 쓴 경우인 PAC-Q
와 state value function (V-function)을 사용한 PAC-V
를 둘 다 제안했으며, 이 떄 특이하게 C51같은 distributional critic을 사용했다는 점에 이끌려 논문을 보게 되었다.
Background and Notation
- Multi-task MDP
- action distribution at \(t\): \(a_t \in A\)
- state distribution at \(t\): \(s_t \in S\)
- reward specific to task \(\tau \in T\): \(r_{t+1} = R(s_t, a_t, \tau) \in R\)
- tansition to the next state: \(s_{t+1} \sim p(\cdot \vert s_t,a_t)\)
- What RL algorithm seeks
- policy: \(\pi(a_t \vert s_t, \tau)\)
- per-task discounted cumulative return: \(\mathbb{E}_{p_{\pi}} [ \sum_{t=0}^{\infty} \gamma^t R(s_t,a_t,\tau) ]\)
- we want to find a policy that maximize per-task discounted cumulative return under the trajectory distribution \(p_{\pi}\)
- offline dataset: \(D=\{ (s_t, a_t, s_{t+1}, \tau) \}\), generated by behavior policy, \(b(a_t \vert s_t,\tau)\)
- Q-function (state-action value function): \(Q^{\pi}(s_t,a_t,\tau) = \mathbb{E}_{p_{\pi}, s_k=s_t, a_k=a_t} [ \sum_{k=t}^{\infty} \gamma^{k-t} R(s_k, a_k, \tau) ]\)
- V-function (state value function): \(V^{\pi}(s_t,\tau) = \mathbb{E}_{a_t \sim \pi(\cdot \vert s_t, \tau)} [ Q^{\pi}(s_t,a_t,\tau)]\)
- A-function (advantage function): \(A^{\pi}(s_t,a_t,\tau) = Q^{\pi}(s_t,a_t,\tau)-V^{\pi}(s_t,\tau)\)
- objective function of Behaviour Cloning (BC) term: \(\mathbb{E}_{(s_t, \tau) \in D} D_{KL} [b,\pi \vert s_t,\tau] = - \mathbb{E}_D \log \pi(a_t \vert s_t, \tau) + K_{BC}\) where \(K_{BC}\) is a constant offset
Offline KL-Regularized Actor-Critic
Paper에서 저자들이 target하는 objective function은 KL-regularized RL objective
으로,
goal은 reference policy, \(\tilde{\pi}\)보다 개선된 \(\pi_{imp}\)를 찾는 것이라고 한다.
- our target is KL-regularized RL objective
- goal: \(\pi_{imp}\)
- reference policy: \(\tilde{\pi}\)
- objective: \(\pi_{imp} = \arg \max_{\pi} J(\pi)\)
- where \(J(\pi) = \mathbb{E}_{(s_t,\tau) \in D} [ \mathbb{E}_{a_t \sim \pi} [Q^{\pi} (s_t,a_t,\tau)] - \eta D_{KL}[\pi, \tilde{\pi} \vert s_t, \tau]]\)
- where \(\eta\) is hparam determining the strength of regularization towards the ref policy.
- where \(J(\pi) = \mathbb{E}_{(s_t,\tau) \in D} [ \mathbb{E}_{a_t \sim \pi} [Q^{\pi} (s_t,a_t,\tau)] - \eta D_{KL}[\pi, \tilde{\pi} \vert s_t, \tau]]\)