Thoughts on Location of Normalization Layer in Transformer (POV residual connection (identity mapping), representation collapse and gradient stability)


< 목차 >


Motivation

Emergent of Pre-Norm

Representation Collapse and Deeper is Better for Reasoning Capability

References