Online Softmax by Hand
In today's video I want to walk you through three different ways of calculating Softmax, culminating in Online Softmax calculation and different pitfalls.
The Softmax calculation is an important operation in Neural Networks, yet one that is not talked about in detail very often. In today’s video I want to run you through different ways of calculating Softmax (Naive, Safe and Online Softmax) and you with that also lay the foundation for understand the difference between Self-Attention and FlashAttention, since swapping out Safe Softmax with Online-Softmax is one of the one of the important optimizations introduced by FlashAttention.
Sources and References
https://arxiv.org/abs/1805.02867