Categories

MachineLearning

(WIP) Expectation Maximization (EM) vs Variational Inference (VI)

(WIP) Regression (4/7) - Kernelization and Gaussian processes

Classification (1/4) - Logistic Regression and Optimization

Error BackPropagation

Generative vs Discriminative Models

Information Theory, Entropy and Kullback-Leibler Divergence (KLD)

Iterative Optimization Algorithms for ML (1/4) - Basics

L1 & L2 Regularization

MLE & Bayesian Series (1/3) - Maximum Likelihood Estimation (MLE)

MLE & Bayesian Series (2/3) - Maximum A Posteriori (MAP)

MLE & Bayesian Series (3/3) - Bayesian Approach

Neural Network (NN) and Representation

Precision, Recall and F1 Score

Principle Component Analysis (PCA) and AutoEncoder (AE)

Regression (1/7) - Linear Regression

Regression (2/7) - Bayesian Linear Regression

Regression (3/7) - Non-linear regression

Resources

Docker

Resources for Torch internal (Autograds +a), CUDA, Compiler and so on

Useless Commands for Code Editors

Useless Github Debugging History

Deep_Generative_Model

(WIP) A Long Way to Deep Generative Models - Variational AutoEncoders (VAEs)

Training_and_Inference_Optimization

(WIP) Async TP

(WIP) Communication Overlap and Gradient/Parameter Bucketing (and Some Profiling and Debugging Logs)

(WIP) Distributed Training for Large Scale Transformer (3/6) - Pipeline Parallelism (PP)

(WIP) Distributed Training for Large Scale Transformer (4/6) - Sequence Parallelism (SP) and Context Parallelism (CP)

(WIP) Distributed Training for Large Scale Transformer (5/6) - Expert Parallelism (EP)

(WIP) Distributed Training for Large Scale Transformer (6/6) - Advanced ZeRO (ZeRO++ and so on)

(WIP) GPU Programming (2/6) - GPU Programming (CPU vs GPU / Parallel Matmul)

(WIP) Theoretical Memory Usage of Large Language Model (LLM)

(almost) Efficient Scaled Dot Product Attention (SDPA)

(yet) (Paper) Context Parallelism for Scalable Million-Token Inference

(yet) CUDA Graph and Torch Compile

(yet) DSPy

(yet) Sparse Matrix

(yet) Understanding NVIDIA GPU Utilization

Distributed Training for Large Scale Transformer (1/6) - Overview of Parallelism, Data Parallelism (DP) and ZeRO

Distributed Training for Large Scale Transformer (2/6) - Tensor Parallelism (TP)

Dynamic Batching (Token Batching) for Sequence Dataset with Variable Lengths

GPU Programming (1/6) - The Graphics Processing Unit (GPU) Revolution

GPU Programming (3/6) - Writing High Performance GPU Kernel using Triton Overview (+Fused Softmax and Fused Xent Examples)

Machine FLOPs Utilization (MFU)

Training DNN with Reduced Precision Floating-Point Format

Speech

(ASR) A Long Way To CTC BeamSearch (1)

DeepLearning

(Re) Your mu-Transferred LR Could Not Be Optimal

(WIP) (Paper) Scaling Exponents Across Parameterizations and Optimizers

(WIP) (Sparse) Mixture Of Experts (MoE)

(WIP) Activation Functions and Gated Linear Unit (GLU)

(WIP) Basic Benchmarks for Large Language Modeling (LLM)

(WIP) Deep Dive into Normalization Modules (Layer Normalization (LayerNorm; LN), Root Mean Squared (RMS) Normalization (RMSNorm), and Weight Normalization (WN))

(WIP) How to (Re-)Warmup Pre-trained Model

(WIP) How to measure feature learning ? Canonoical Correlation Analysis (CCA) and Centered Kernel Anaylsis (CKA)

(WIP) Iterative Optimization Algorithms for ML (2/4) - Deep Dive Into Adaptive Optimizers (AdaGrad, RMSProp Adam). (Why It Works? / Importance of Beta 1,2 and epsilon / Adam Variants)

(WIP) Iterative Optimization Algorithms for ML (3/4) - Higher Order Methods For Deep Learning (K-FAC, Shampoo and so on)

(WIP) Neural Tangent Kernel (NTK) and Mean Field Theory (MFT)

(WIP) Relationship between Logit Growth Problems of Deep NN and LayerNorm

(WIP) Rethinking Conventional Wisdom from the LLM perspective

(WIP) Right Way To Scale Neural Networks. (Tensor Program 4 and 5; Maximal Update Parametrization (muP) and Hyperparameter Transfer (muTransfer))

(WIP) Scaling Law for Autoregressive Transformer based Language Model

(WIP) Some Thoughts on Synthetic Datasets and MS Phi Series

(yet) (Paper) Transformers Learn Higher-Order Optimization Methods for In-Context Learning, A Study with Linear Models

(yet) (Paper) Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning

(yet) Exponential Moving Average (EMA) for Training Stability

(yet) Low Rank Gradient Updates

(yet) NTK-Aware Scaled RoPE for Long Context

(yet) NanoGPT Speedrun and MomentUm Orthogonalized by Newton-Schulz (Muon)

(yet) Neural Ordinary Differential Equations (NODEs) and Deep Implicit Layers

(yet) Param and Activation Norm Growth and Loss Spike

(yet) Scalable Optimization in the Modular Norm

Cheatsheet For "How To Scale"

Convolution Families

Course Overview of CSC2541 (Topics in ML - Neural Net Training Dynamics)

Critical Batch Size (Large Batch Training Difficulties)

Deep Dive into Low Rank Adaptation (LoRA)

Some Legendary Emails from OGs

Thoughts on Location of Normalization Layer in Transformer (POV residual connection (identity mapping), representation collapse and gradient stability)

What and Why Training Dynamics?

My_Thoughts_and_Somethings_To_Read

An Opinionated Guide to ML Research from John Schulman

Difference between Research Scientist vs Engineer

How To Be Successful? from Sam Altman

Interview List of DxxxMind

The Bitter Lesson from Richard Sutton

What happened on the way to getting a job as a ML researcher

Implementation_and_Debugging

(WIP) GPU Programming (4/6) - Triton Impl of LayerNorm

(WIP) GPU Programming (5/6) - Triton Impl of Fused Attention

(yet) GPU Programming (6/6) Triton Impl of Ring Attention

(yet) Implmentation of Distributed Attention (DS-Ulysses)

(yet) Pytorch Impl of Distributed Shampoo

CrossEntropyLoss vs NLL (feat. REINFORCE)

Educational Implementation of Tensor Parallel (TP)

GPU/GRAD_ACCUM/BSZ (4/1/4) vs (2/2/4) Is Not Same

Gradient Clipping

LLM-RLHF Series (6/6) - Implementation Details of PPO and RLHF

Pytorch Implementation of Variational AutoEncoders (VAEs)

REINFORCE and Actor-Critic

Tutorial on PyTorch Hook

Deep_Reinforcement_Learning

(CS285) Lecture 2 - Supervised Learning of Behaviors

(CS285) Lecture 4 - Introduction to Reinforcement Learning

(CS285) Lecture 5 - Policy Gradients

(CS285) Lecture 6 - Actor-Critic Algorithms

(CS285) Lecture 7 - Value Function Methods

(CS285) Lecture 8 - Deep RL with Q-Functions

(CS285) Lecture 9 - Advanced Policy Gradients

(WIP) (CS285) Lecture 10 - Optimal Control and Planning

(WIP) (CS285) Lecture 20 - Inverse Reinforcement Learning

(WIP) DDPG, TD3 and SAC

(WIP) Deep dive into TRPO and PPO

(WIP) Distributional RL (Categorical DQN (C51), Quantile Regression DQN (QR-DQN) and so on)

(WIP) Off-Policy RL

(WIP) Offline RL (corresponding to CS285 Lec 15 and 16)

(WIP) Thoughts on o1

(yet) (CS285) Lecture 18 - Variational Inference and Generative Models

(yet) (CS285) Lecture 19 - Reframing Control as an Inference Problem

(yet) From AlphaGo to MuZero

RLHF

(Paper) Distributional Preference Learning (DPL)

(yet) Study on Preference based Learning