(yet) Offline Actor-Critic (Perceiver Actor-Critic; PAC) Algorithm

< 목차 >

Introduction (Motivation)
- Background and Notation
- Offline KL-Regularized Actor-Critic
Reference

Introduction (Motivation)

DeepMind에서 Offline Actor-Critic RL algorithm의 transformer같은 large model에 대한 scaling laws를 밝혀냈다고 한다. 132개 continuous control task (robotics같은)에 대한 large scale sub-optimal or expert behavior dataset이 주어졌을 때 Behavior Cloning (BC)보다 훨씬 좋았다, Proposed method는 Perciever 기반의 Actor-Critic method인데, perceiver가 여러 image, text, game등의 modality를 모두 input으로 받을 수 있도록 design된 architecture인 만큼 multi-task policy를 학습한 것 같다.

그래서 algorithm이름이 Perceiver Actor-Critic (PAC)인데, 사실 나는 multimodality에 관심이 있다기 보다는 모든 sacle에서 BC보다 더 우월할 수 있게 만들어준 learning algorithm 자체가 궁금했다. Actor-Critic이므로 우리는 자연스럽게 critic을 design하는데 있어 몇 가지 option이 있을텐데, paper에서는 state-action value function (Q-function)을 쓴 경우인 PAC-Q와 state value function (V-function)을 사용한 PAC-V를 둘 다 제안했으며, 이 떄 특이하게 C51같은 distributional critic을 사용했다는 점에 이끌려 논문을 보게 되었다.

Background and Notation

Multi-task MDP
- action distribution at \(t\): \(a_t \in A\)
- state distribution at \(t\): \(s_t \in S\)
- reward specific to task \(\tau \in T\): \(r_{t+1} = R(s_t, a_t, \tau) \in R\)
- tansition to the next state: \(s_{t+1} \sim p(\cdot \vert s_t,a_t)\)
What RL algorithm seeks
- policy: \(\pi(a_t \vert s_t, \tau)\)
- per-task discounted cumulative return: \(\mathbb{E}_{p_{\pi}} [ \sum_{t=0}^{\infty} \gamma^t R(s_t,a_t,\tau) ]\)
- we want to find a policy that maximize per-task discounted cumulative return under the trajectory distribution \(p_{\pi}\)
- offline dataset: \(D=\{ (s_t, a_t, s_{t+1}, \tau) \}\), generated by behavior policy, \(b(a_t \vert s_t,\tau)\)
- Q-function (state-action value function): \(Q^{\pi}(s_t,a_t,\tau) = \mathbb{E}_{p_{\pi}, s_k=s_t, a_k=a_t} [ \sum_{k=t}^{\infty} \gamma^{k-t} R(s_k, a_k, \tau) ]\)
- V-function (state value function): \(V^{\pi}(s_t,\tau) = \mathbb{E}_{a_t \sim \pi(\cdot \vert s_t, \tau)} [ Q^{\pi}(s_t,a_t,\tau)]\)
- A-function (advantage function): \(A^{\pi}(s_t,a_t,\tau) = Q^{\pi}(s_t,a_t,\tau)-V^{\pi}(s_t,\tau)\)
- objective function of Behaviour Cloning (BC) term: \(\mathbb{E}_{(s_t, \tau) \in D} D_{KL} [b,\pi \vert s_t,\tau] = - \mathbb{E}_D \log \pi(a_t \vert s_t, \tau) + K_{BC}\) where \(K_{BC}\) is a constant offset

Offline KL-Regularized Actor-Critic

Paper에서 저자들이 target하는 objective function은 KL-regularized RL objective으로, goal은 reference policy, \(\tilde{\pi}\)보다 개선된 \(\pi_{imp}\)를 찾는 것이라고 한다.

our target is KL-regularized RL objective
- goal: \(\pi_{imp}\)
- reference policy: \(\tilde{\pi}\)
- objective: \(\pi_{imp} = \arg \max_{\pi} J(\pi)\)
  - where \(J(\pi) = \mathbb{E}_{(s_t,\tau) \in D} [ \mathbb{E}_{a_t \sim \pi} [Q^{\pi} (s_t,a_t,\tau)] - \eta D_{KL}[\pi, \tilde{\pi} \vert s_t, \tau]]\)
    - where \(\eta\) is hparam determining the strength of regularization towards the ref policy.

\[\begin{aligned} & L^{Q} = \mathbb{E}_D [ (1-\alpha) D_{KL} [ \pi_{imp}, \pi_{\theta} \vert s_t, \tau, \tilde{\pi} = \pi_{\theta'}] & \\ & + \alpha D_{KL} [b, \pi_{\theta} \vert s_t, \tau ] & \\ & + \beta D_{KL} [ \Gamma_{\theta'} (q \vert s_t, a_t, \tau), p_{\theta} ( q \vert s_t, a_t , \tau) ] ] & \\ & = -\mathbb{E}_D [ (1-\alpha) \mathbb{E}_{a' \sim \pi_{\theta'}} [ w(a',s_t,\tau) \log \pi_{\theta} (a' \vert s_t, \tau)] & \\ & + \alpha \log \pi_{\theta} (a_t \vert s_t, \tau) & \\ & + \beta \mathbb{E}_{q \sim \Gamma_{\theta'}} \log p_{\theta} ( q \vert s_t, a_t, \tau) ] + K_H & \\ \end{aligned}\]

Reference

Papers
- Offline Actor-Critic Reinforcement Learning Scales to Large Models

Notes

(yet) Offline Actor-Critic (Perceiver Actor-Critic; PAC) Algorithm

Introduction (Motivation)

Background and Notation

Offline KL-Regularized Actor-Critic

Reference