Thoughts on Location of Normalization Layer in Transformer (POV residual connection (identity mapping), representation collapse and gradient stability)
23 Dec 2024< 목차 >
Motivation
Emergent of Pre-Norm
Representation Collapse and Deeper is Better for Reasoning Capability
References
- Papers
- ResiDual: Transformer with Dual Residual Connections
- Hyper-Connections
-
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
- MoEUT: Mixture-of-Experts Universal Transformers
- 52B to 1T: Lessons Learned via Tele-FLM Series
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
-
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
- Value Residual Learning For Alleviating Attention Concentration In Transformers
- Gemma 2: Improving Open Language Models at a Practical Size
- Others