*** BERT: Bidirectional Encoder Representations from Transformers ***
BERT is one of the most influential NLP papers, it was introduced by Google researchers in 2018, https://arxiv.org/abs/1810.04805.
This 3 minutes video illustrates how BERT works, with a step-by-step illustration of the math behind the complex algorithm.
Why BERT Is Still Popular Today
Generally speaking, BERT remains popular because it is lightweight compared to GPT models; it can be trained on a single GPU by pairing a small architecture with large training data. Architecturally, BERT uses a transformer encoder-only design (as Rahul mentioned) that processes the entire sequence of words at once. Multi-head attention lets the model learn different patterns with each head—more details on these concepts shortly.
Common use cases for BERT include sentiment classification, chat-bot question answering, next-sentence prediction for e-mails, document summarisation, and fine-grained sense disambiguation.
BERT stands for Bi-directional Encoder Representations from Transformers. Its four key concepts are: bi-directional, encoder, position embedding, and self-attention.
1. Bi-directional Context
Unlike Word2Vec, BERT is a bi-directional, context-based algorithm. In the sentence
> “BERT rocks for NLP,”
BERT calculates attention weights between every word and the token “BERT,” learning that “algorithm” is more relevant than “character,” so it correctly interprets “BERT” as the algorithm. This helps chat-bots answer questions about BERT accurately.
2. Encoder Stack
BERT keeps only the transformer encoder. For each token we feed in a composite embedding (word + segment + position). BERT-Large stacks 24 encoders, 16 attention heads per layer, and 1 024-dimensional vectors, producing a context-rich representation for every word.
3. Position Embedding
One of the three input embeddings records absolute word positions—critical because location in the sequence affects meaning. Later models such as LLaMA improve this with RoPE, but BERT’s sinusoidal embeddings still work well.
4. Self-Attention Mechanics
(Toy example: 4 tokens, 5-dim embeddings)
Step | What Happens | Intuition / Math |
---|---|---|
Query | Q = X Wq |
“Which words am I looking at?” |
Key | K = X Wk |
“Which words might be relevant?” |
Score | S = QK^T / √dk |
Pair-wise compatibility |
Weight | softmax(S) |
Probability distribution over context |
Output | Z = weights · V |
Context-aware update sent to next sub-layer |
Higher attention weights mean the query word is more influenced by the key word. Because we use the same input for Q, K, and V, this is self-attention—the actual computation happening inside each encoder block.
References
[¹] Devlin, J. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.