Home
Introduction
Categories
© 2024. All rights reserved.
Notes
Categories
Display All Posts
MachineLearning
(WIP) Expectation Maximization (EM) vs Variational Inference (VI)
(WIP) Regression (4/7) - Kernelization and Gaussian processes
Classification (1/4) - Logistic Regression and Optimization
Error BackPropagation
Generative vs Discriminative Models
Information Theory, Entropy and Kullback-Leibler Divergence (KLD)
Iterative Optimization Algorithms for ML (1/4) - Basics
L1 & L2 Regularization
MLE & Bayesian Series (1/3) - Maximum Likelihood Estimation (MLE)
MLE & Bayesian Series (2/3) - Maximum A Posteriori (MAP)
MLE & Bayesian Series (3/3) - Bayesian Approach
Neural Network (NN) and Representation
Precision, Recall and F1 Score
Principle Component Analysis (PCA) and AutoEncoder (AE)
Regression (1/7) - Linear Regression
Regression (2/7) - Bayesian Linear Regression
Regression (3/7) - Non-linear regression
Resources
Docker
Resources for Torch internal (Autograds +a), CUDA, Compiler and so on
Useless Commands for Code Editors
Useless Github Debugging History
Deep_Generative_Model
(WIP) A Long Way to Deep Generative Models - Variational AutoEncoders (VAEs)
Training_and_Inference_Optimization
(WIP) Distributed Training for Large Scale Transformer (3/6) - Pipeline Parallelism (PP)
(WIP) Distributed Training for Large Scale Transformer (4/6) - Sequence Parallelism (SP) and Context Parallelism (CP)
(WIP) Distributed Training for Large Scale Transformer (5/6) - Expert Parallelism (EP)
(WIP) Distributed Training for Large Scale Transformer (6/6) - Advanced ZeRO (ZeRO++ and so on)
(WIP) GPU Programming (2/6) - GPU Programming (CPU vs GPU / Parallel Matmul)
(almost) Efficient Scaled Dot Product Attention (SDPA)
(yet) Parallel Blocks
(yet) Sparse Matrix
(yet) Understanding NVIDIA GPU Utilization
Distributed Training for Large Scale Transformer (1/6) - Overview of Parallelism, Data Parallelism (DP) and ZeRO
Distributed Training for Large Scale Transformer (2/6) - Tensor Parallelism (TP)
Dynamic Batching (Token Batching) for Sequence Dataset with Variable Lengths
GPU Programming (1/6) - The Graphics Processing Unit (GPU) Revolution
GPU Programming (3/6) - Writing High Performance GPU Kernel using Triton Overview (+Fused Softmax and Fused Xent Examples)
Machine FLOPs Utilization (MFU)
Training DNN with Reduced Precision Floating-Point Format
Speech
(ASR) A Long Way To CTC BeamSearch (1)
DeepLearning
(Re) Your mu-Transferred LR Could Not Be Optimal
(WIP) (Paper) Scaling Exponents Across Parameterizations and Optimizers
(WIP) (Sparse) Mixture Of Experts (MoE)
(WIP) Activation Functions and Gated Linear Unit (GLU)
(WIP) Basic Benchmarks for Large Language Modeling (LLM)
(WIP) Deep Dive into Normalization Modules (Layer Normalization (LayerNorm; LN), Root Mean Squared (RMS) Normalization (RMSNorm), and Weight Normalization (WN))
(WIP) How to measure feature learning ? Canonoical Correlation Analysis (CCA) and Centered Kernel Anaylsis (CKA)
(WIP) Iterative Optimization Algorithms for ML (2/4) - Deep Dive Into Adaptive Optimizers (AdaGrad, RMSProp Adam). (Why It Works? / Importance of Beta 1,2 and epsilon / Adam Variants)
(WIP) Neural Tangent Kernel (NTK) and Mean Field Theory (MFT)
(WIP) Recent Promising GPT Variants (Diff Transformer and nGPT)
(WIP) Relationship between Logit Growth Problems of Deep NN and LayerNorm
(WIP) Rethinking Conventional Wisdom from the LLM perspective
(WIP) Right Way To Scale Neural Networks. (Tensor Program 4 and 5; Maximal Update Parametrization (muP) and Hyperparameter Transfer (muTransfer))
(WIP) Scaling Law for Autoregressive Transformer based Language Model
(yet) (Paper) Multi Token Prediction (MTP)
(yet) (Paper) Transformers Learn Higher-Order Optimization Methods for In-Context Learning, A Study with Linear Models
(yet) (Paper) Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning
(yet) Exponential Moving Average (EMA) for Training Stability
(yet) Low Rank Gradient Updates
(yet) NTK-Aware Scaled RoPE for Long Context
(yet) Neural Ordinary Differential Equations (NODEs) and Deep Implicit Layers
(yet) Param and Activation Norm Growth and Loss Spike
(yet) Physics of Language Models
(yet) Scalable Optimization in the Modular Norm
Cheatsheet For "How To Scale"
Convolution Families
Course Overview of CSC2541 (Topics in ML - Neural Net Training Dynamics)
Critical Batch Size (Large Batch Training Difficulties)
Deep Dive into Low Rank Adaptation (LoRA)
What and Why Training Dynamics?
My_Thoughts_and_Somethings_To_Read
(WIP) Is AGI near? (About Machines of Loving Grace from Dario Amodei and MLE-bench of OpenAI)
An Opinionated Guide to ML Research from John Schulman
Difference between Research Scientist vs Engineer
How To Be Successful? from Sam Altman
Interview List of DxxxMind
The Bitter Lesson from Richard Sutton
What happened on the way to getting a job as a ML researcher
Implementation_and_Debugging
(WIP) GPU Programming (4/6) - Triton Impl of LayerNorm
(WIP) GPU Programming (5/6) - Triton Impl of Fused Attention
(WIP) Learn from pytorch/lingua (Learning How To 3D Parallelize LLaMa-3 With Pytorch's Native Parallelism Modules and so on)
(yet) GPU Programming (6/6) Triton Impl of Ring Attention
(yet) Implmentation of Distributed Attention (DS-Ulysses)
(yet) Pytorch Impl of Distributed Shampoo
CrossEntropyLoss vs NLL (feat. REINFORCE)
Educational Implementation of Tensor Parallel (TP)
GPU/GRAD_ACCUM/BSZ (4/1/4) vs (2/2/4) Is Not Same
Gradient Clipping
LLM-RLHF Series (6/6) - Implementation Details of PPO and RLHF
Pytorch Implementation of Variational AutoEncoders (VAEs)
REINFORCE and Actor-Critic
Tutorial on PyTorch Hook
Deep_Reinforcement_Learning
(CS285) Lecture 2 - Supervised Learning of Behaviors
(CS285) Lecture 4 - Introduction to Reinforcement Learning
(CS285) Lecture 5 - Policy Gradients
(CS285) Lecture 6 - Actor-Critic Algorithms
(CS285) Lecture 7 - Value Function Methods
(CS285) Lecture 8 - Deep RL with Q-Functions
(CS285) Lecture 9 - Advanced Policy Gradients
(WIP) (CS285) Lecture 10 - Optimal Control and Planning
(WIP) (CS285) Lecture 20 - Inverse Reinforcement Learning
(WIP) DDPG, TD3 and SAC
(WIP) Deep dive into TRPO and PPO
(WIP) Distributional RL (Categorical DQN (C51), Quantile Regression DQN (QR-DQN) and so on)
(WIP) Off-Policy RL
(WIP) Offline RL (corresponding to CS285 Lec 15 and 16)
(yet) (CS285) Lecture 18 - Variational Inference and Generative Models
(yet) (CS285) Lecture 19 - Reframing Control as an Inference Problem
(yet) From AlphaGo to MuZero
(yet) Learn From o1 (how to make LLM reason and plan?)
RLHF
(Paper) Distributional Preference Learning (DPL)
(yet) Study on Preference based Learning