How LLMs Read: Tokenization & Embeddings

For a long time, I’ve kept my learning and building in private. Today, I’m changing that.

I will be sharing my deep dives into 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀, 𝗟𝗟𝗠 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲, 𝗮𝗻𝗱 𝗥𝗔𝗚 𝘀𝘆𝘀𝘁𝗲𝗺𝘀. My goal is simple: to connect with fellow builders and provide a clear intuition of the concepts for anyone else who wants to learn these systems from the ground up.

Let's start at the absolute base layer: 𝙏𝙤𝙠𝙚𝙣𝙞𝙯𝙖𝙩𝙞𝙤𝙣 & 𝙀𝙢𝙗𝙚𝙙𝙙𝙞𝙣𝙜𝙨.

Before an LLM can do geometry, it has to chop up your words.

𝗦𝘁𝗲𝗽 𝟭: The Chop (Tokenization) Models don't understand the word "unhappiness". They use algorithms to chop it into sub-words: [un], [happi], [ness].

Each piece gets a unique integer ID from a fixed vocabulary.

𝗦𝘁𝗲𝗽 𝟮: The Map (Embeddings) Integers aren't enough. To a computer, the number 4 has no relationship to the number 5.

The model passes those IDs through an Embedding Layer (a massive lookup table). This table assigns a multi-dimensional spatial coordinate to every single token.

On this map, the coordinate for "𝗞𝗶𝗻𝗴" mathematically sits next to "𝗤𝘂𝗲𝗲𝗻". Instead of counting words, the model computes the spatial, geometric distance between concepts.

To truly understand it, you have to build it. Swipe through to see the PyTorch code mapping text to semantic vectors from scratch.

import torch
import torch.nn as nn

# In modern LLMs (like Llama 3), we use sub-word tokenization
# to handle complex or rare words efficiently.
vocab = {"the": 0, "cat": 1, "sat": 2, "un": 3, "happi": 4, "ness": 5}

# Raw input text
raw_text = "the unhappiness"

# The Tokenizer chops 'unhappiness' into known sub-words: [un], [happi], [ness]
tokens = ["the", "un", "happi", "ness"]

# Every sub-word is mapped to its unique integer ID from the vocabulary
token_ids = torch.tensor([vocab[token] for token in tokens])

print(f"Input Text: {raw_text}")
print(f"Token IDs:  {token_ids.tolist()}")
# Result: [0, 3, 4, 5]

# The Embedding Layer is essentially a massive lookup table (Weight Matrix).
# num_embeddings = Size of Vocabulary (6)
# embedding_dim = Number of dimensions per word (4 for this demo)
torch.manual_seed(42)
embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=4)

# We can view the raw weight matrix the model will learn during training:
print("Raw Embedding Weights (W_e):")
print(embedding_layer.weight.data)

# Each row above represents the 'coordinate' of a specific token.

# We pass our Token IDs through the layer to retrieve their vectors.
# This is where the model begins to 'understand' semantic relationships.
dense_vectors = embedding_layer(token_ids)

print("Final Semantic Vectors (The 'Geometry'):")
# Each token now has a 4-dimensional representation
for i, token in enumerate(tokens):
    print(f"{token: <6} : {dense_vectors[i].detach().numpy()}")

# TAKEAWAY: Words with similar meanings will eventually have
# coordinates that sit close to each other in this high-dimensional space.

How LLMs actually read

Comments

Deep Dive: LLMs, Transformers, and RAG

Command Palette

Comments

Deep Dive: LLMs, Transformers, and RAG