Home
Introduction
Categories
© 2024. All rights reserved.
Notes
Categories
Display All Posts
MachineLearning
(WIP) Expectation Maximization (EM) vs Variational Inference (VI)
(WIP) Regression (4/7) - Kernelization and Gaussian processes
Classification (1/4) - Logistic Regression and Optimization
Error BackPropagation
Generative vs Discriminative Models
Information Theory, Entropy and Kullback-Leibler Divergence (KLD)
Iterative Optimization Algorithms for ML (1/4) - Basics
L1 & L2 Regularization
MLE & Bayesian Series (1/3) - Maximum Likelihood Estimation (MLE)
MLE & Bayesian Series (2/3) - Maximum A Posteriori (MAP)
MLE & Bayesian Series (3/3) - Bayesian Approach
Neural Network (NN) and Representation
Precision, Recall and F1 Score
Principle Component Analysis (PCA) and AutoEncoder (AE)
Regression (1/7) - Linear Regression
Regression (2/7) - Bayesian Linear Regression
Regression (3/7) - Non-linear regression
Resources
Docker
Resources for Torch internal (Autograds +a), CUDA, Compiler and so on
Useless Commands for Code Editors
Useless Github Debugging History
Deep_Generative_Model
(WIP) A Long Way to Deep Generative Models - Variational AutoEncoders (VAEs)
Training_and_Inference_Optimization
(WIP) Async TP
(WIP) Communication Overlap and Gradient/Parameter Bucketing (and Some Profiling and Debugging Logs)
(WIP) Distributed Training for Large Scale Transformer (3/6) - Pipeline Parallelism (PP)
(WIP) Distributed Training for Large Scale Transformer (4/6) - Sequence Parallelism (SP) and Context Parallelism (CP)
(WIP) Distributed Training for Large Scale Transformer (5/6) - Expert Parallelism (EP)
(WIP) Distributed Training for Large Scale Transformer (6/6) - Advanced ZeRO (ZeRO++ and so on)
(WIP) GPU Programming (2/6) - GPU Programming (CPU vs GPU / Parallel Matmul)
(WIP) Theoretical Memory Usage of Large Language Model (LLM)
(almost) Efficient Scaled Dot Product Attention (SDPA)
(yet) (Paper) Context Parallelism for Scalable Million-Token Inference
(yet) CUDA Graph and Torch Compile
(yet) DSPy
(yet) Sparse Matrix
(yet) Understanding NVIDIA GPU Utilization
Distributed Training for Large Scale Transformer (1/6) - Overview of Parallelism, Data Parallelism (DP) and ZeRO
Distributed Training for Large Scale Transformer (2/6) - Tensor Parallelism (TP)
Dynamic Batching (Token Batching) for Sequence Dataset with Variable Lengths
GPU Programming (1/6) - The Graphics Processing Unit (GPU) Revolution
GPU Programming (3/6) - Writing High Performance GPU Kernel using Triton Overview (+Fused Softmax and Fused Xent Examples)
Machine FLOPs Utilization (MFU)
Training DNN with Reduced Precision Floating-Point Format
Speech
(ASR) A Long Way To CTC BeamSearch (1)
DeepLearning
(Re) Your mu-Transferred LR Could Not Be Optimal
(WIP) (Paper) Scaling Exponents Across Parameterizations and Optimizers
(WIP) (Sparse) Mixture Of Experts (MoE)
(WIP) Activation Functions and Gated Linear Unit (GLU)
(WIP) Basic Benchmarks for Large Language Modeling (LLM)
(WIP) Deep Dive into Normalization Modules (Layer Normalization (LayerNorm; LN), Root Mean Squared (RMS) Normalization (RMSNorm), and Weight Normalization (WN))
(WIP) How to (Re-)Warmup Pre-trained Model
(WIP) How to measure feature learning ? Canonoical Correlation Analysis (CCA) and Centered Kernel Anaylsis (CKA)
(WIP) Iterative Optimization Algorithms for ML (2/4) - Deep Dive Into Adaptive Optimizers (AdaGrad, RMSProp Adam). (Why It Works? / Importance of Beta 1,2 and epsilon / Adam Variants)
(WIP) Iterative Optimization Algorithms for ML (3/4) - Higher Order Methods For Deep Learning (K-FAC, Shampoo and so on)
(WIP) Neural Tangent Kernel (NTK) and Mean Field Theory (MFT)
(WIP) Relationship between Logit Growth Problems of Deep NN and LayerNorm
(WIP) Rethinking Conventional Wisdom from the LLM perspective
(WIP) Right Way To Scale Neural Networks. (Tensor Program 4 and 5; Maximal Update Parametrization (muP) and Hyperparameter Transfer (muTransfer))
(WIP) Scaling Law for Autoregressive Transformer based Language Model
(WIP) Some Thoughts on Synthetic Datasets and MS Phi Series
(yet) (Paper) Transformers Learn Higher-Order Optimization Methods for In-Context Learning, A Study with Linear Models
(yet) (Paper) Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning
(yet) Exponential Moving Average (EMA) for Training Stability
(yet) Low Rank Gradient Updates
(yet) NTK-Aware Scaled RoPE for Long Context
(yet) NanoGPT Speedrun and MomentUm Orthogonalized by Newton-Schulz (Muon)
(yet) Neural Ordinary Differential Equations (NODEs) and Deep Implicit Layers
(yet) Param and Activation Norm Growth and Loss Spike
(yet) Scalable Optimization in the Modular Norm
Cheatsheet For "How To Scale"
Convolution Families
Course Overview of CSC2541 (Topics in ML - Neural Net Training Dynamics)
Critical Batch Size (Large Batch Training Difficulties)
Deep Dive into Low Rank Adaptation (LoRA)
Some Legendary Emails from OGs
Thoughts on Location of Normalization Layer in Transformer (POV residual connection (identity mapping), representation collapse and gradient stability)
What and Why Training Dynamics?
My_Thoughts_and_Somethings_To_Read
An Opinionated Guide to ML Research from John Schulman
Difference between Research Scientist vs Engineer
How To Be Successful? from Sam Altman
Interview List of DxxxMind
The Bitter Lesson from Richard Sutton
What happened on the way to getting a job as a ML researcher
Implementation_and_Debugging
(WIP) GPU Programming (4/6) - Triton Impl of LayerNorm
(WIP) GPU Programming (5/6) - Triton Impl of Fused Attention
(yet) GPU Programming (6/6) Triton Impl of Ring Attention
(yet) Implmentation of Distributed Attention (DS-Ulysses)
(yet) Pytorch Impl of Distributed Shampoo
CrossEntropyLoss vs NLL (feat. REINFORCE)
Educational Implementation of Tensor Parallel (TP)
GPU/GRAD_ACCUM/BSZ (4/1/4) vs (2/2/4) Is Not Same
Gradient Clipping
LLM-RLHF Series (6/6) - Implementation Details of PPO and RLHF
Pytorch Implementation of Variational AutoEncoders (VAEs)
REINFORCE and Actor-Critic
Tutorial on PyTorch Hook
Deep_Reinforcement_Learning
(CS285) Lecture 2 - Supervised Learning of Behaviors
(CS285) Lecture 4 - Introduction to Reinforcement Learning
(CS285) Lecture 5 - Policy Gradients
(CS285) Lecture 6 - Actor-Critic Algorithms
(CS285) Lecture 7 - Value Function Methods
(CS285) Lecture 8 - Deep RL with Q-Functions
(CS285) Lecture 9 - Advanced Policy Gradients
(WIP) (CS285) Lecture 10 - Optimal Control and Planning
(WIP) (CS285) Lecture 20 - Inverse Reinforcement Learning
(WIP) DDPG, TD3 and SAC
(WIP) Deep dive into TRPO and PPO
(WIP) Distributional RL (Categorical DQN (C51), Quantile Regression DQN (QR-DQN) and so on)
(WIP) Off-Policy RL
(WIP) Offline RL (corresponding to CS285 Lec 15 and 16)
(WIP) Thoughts on o1
(yet) (CS285) Lecture 18 - Variational Inference and Generative Models
(yet) (CS285) Lecture 19 - Reframing Control as an Inference Problem
(yet) From AlphaGo to MuZero
RLHF
(Paper) Distributional Preference Learning (DPL)
(yet) Study on Preference based Learning