README.md (8945B)
1 # tiny-llm-demo 2 3 A very small language-model project meant to show the raw mechanics behind an 4 LLM-like system without hiding them behind frameworks. 5 6 These are not real LLMs. They are tiny plain-Python demos that progressively 7 move from a simple neural next-token model toward transformer-like structure. 8 The point is to make the moving parts visible: 9 10 - text becomes tokens 11 - tokens become vectors 12 - vectors go through a small neural network 13 - the network predicts the next token 14 - training nudges weights so the prediction gets better 15 16 ## Files 17 18 - `tiny_lm.py`: model, training loop, and text generation 19 - `tiny_transformer_lm.py`: second version with a minimal self-attention block 20 - `tiny_modern_lm.py`: forward-only demo of a more modern transformer block 21 - `corpus.txt`: small training text 22 23 ## Run 24 25 ```bash 26 cd /home/dude/repositories/beep/tiny-llm-demo 27 python3 tiny_lm.py 28 ``` 29 30 Generate from a custom prompt: 31 32 ```bash 33 python3 tiny_lm.py --prompt "language models " --sample-length 200 34 ``` 35 36 Run the attention-based version: 37 38 ```bash 39 python3 tiny_transformer_lm.py 40 ``` 41 42 Show the learned attention weights over the prompt context: 43 44 ```bash 45 python3 tiny_transformer_lm.py --show-attention 46 ``` 47 48 Inspect a more modern transformer-style block: 49 50 ```bash 51 python3 tiny_modern_lm.py 52 ``` 53 54 ## What the code is showing 55 56 The model has: 57 58 1. A character vocabulary 59 2. Token embeddings 60 3. A context window that keeps token positions separate 61 4. A hidden layer with `tanh` 62 5. An output layer that predicts the next character 63 64 At each training step: 65 66 1. Take a short slice of text as context 67 2. Ask the model for the next-character probabilities 68 3. Compare them to the real next character 69 4. Compute the loss 70 5. Backpropagate the error 71 6. Update the weights with SGD 72 73 This is the same high-level story as a real LLM, just drastically smaller and 74 with a much simpler architecture. 75 76 For the sake of a short demo, the script repeats the small corpus several times 77 by default so the model can learn visible patterns faster. 78 79 ## What it is not showing 80 81 - a full transformer stack 82 - multi-head attention 83 - parallel GPU training 84 - distributed data loading 85 - instruction tuning 86 - RLHF / preference optimization 87 - inference optimization 88 89 Those are important in real systems, but this project is meant to answer the 90 more basic question: what does it look like before all that complexity is added? 91 92 ## Why there are two versions 93 94 `tiny_lm.py` is the simpler baseline. 95 96 - It takes the whole context window and feeds it through a fixed learned network. 97 - Every position matters only because the code gives it a slot in the input. 98 - There is no dynamic decision about which earlier token to focus on. 99 100 `tiny_transformer_lm.py` is the more transformer-like version. 101 102 - Each position gets its own vector. 103 - The last position forms a query. 104 - Earlier positions produce keys and values. 105 - Attention scores decide which earlier positions matter for the next prediction. 106 107 That is the conceptual jump toward transformers: instead of using one fixed 108 computation over the context, the model learns to route information dynamically 109 based on token-to-token interactions. 110 111 In this toy version you can inspect that directly with `--show-attention`, 112 which prints the weight assigned to each character position in the current 113 prompt context. 114 115 `tiny_modern_lm.py` goes one step closer to a real LLM block. 116 117 - layer norm before sublayers 118 - causal multi-head self-attention 119 - residual connection after attention 120 - feed-forward MLP 121 - residual connection after the MLP 122 - output projection to next-token logits 123 124 This is much closer to the shape of a modern transformer block, but it is 125 forward-only. The earlier two scripts are the training demos. 126 127 ## Modern LLM Concepts 128 129 If you want a compact mental model, these are the core ideas: 130 131 1. Text is tokenized. 132 2. Tokens become learned vectors called embeddings. 133 3. Position information is added so order is preserved. 134 4. The model applies many stacked transformer blocks. 135 5. Each block uses self-attention so tokens can dynamically weigh other tokens. 136 6. Each block also uses a feed-forward sublayer to transform each position. 137 7. Residual connections help information and gradients flow through deep stacks. 138 8. Layer normalization or RMSNorm helps stabilize training. 139 9. The final vectors are projected into logits over the vocabulary. 140 10. Training minimizes next-token prediction error over huge corpora. 141 142 ## What Each Script Represents 143 144 `tiny_lm.py` 145 146 - Shows raw neural next-token training. 147 - The context is pushed through a fixed learned network. 148 - Good for understanding random initialization, loss, backprop, and SGD. 149 150 `tiny_transformer_lm.py` 151 152 - Shows the key jump toward transformer behavior. 153 - The model uses query/key/value attention to weight context positions dynamically. 154 - Good for understanding why attention is different from a fixed MLP over context. 155 156 `tiny_modern_lm.py` 157 158 - Shows the architectural shape of a more modern LLM block. 159 - Adds multi-head attention, layer norm, feed-forward layers, and residual paths. 160 - Good for understanding how modern transformer blocks are composed. 161 162 ## What Is Still Missing From These Toys 163 164 Even with the third script, these demos are still missing important parts of a 165 real modern LLM: 166 167 - many stacked transformer blocks instead of one tiny block 168 - real training for the full modern block 169 - subword tokenization such as BPE instead of character-level tokens 170 - large-scale optimization such as AdamW, schedules, clipping, and mixed precision 171 - large curated datasets and large-scale data pipelines 172 - efficient inference features such as KV cache, batching, and quantization 173 - post-training such as instruction tuning and preference optimization 174 - deployment and serving systems 175 176 So the rough ladder is: 177 178 - `tiny_lm.py`: neural language model basics 179 - `tiny_transformer_lm.py`: attention basics 180 - `tiny_modern_lm.py`: modern transformer block shape 181 182 That gets you much closer to a modern LLM conceptually, even though the scale 183 and training sophistication are still far away. 184 185 ## Mapping To Real LLM Terms 186 187 If you want to connect these demos to standard terminology, use this table: 188 189 - `character vocabulary` / `stoi` / `itos` 190 Real term: tokenizer vocabulary 191 In real systems this is usually a subword vocabulary such as BPE tokens, not characters. 192 193 - `token embeddings` 194 Real term: embedding layer 195 This is the learned lookup table that turns token IDs into dense vectors. 196 197 - `positional embeddings` 198 Real term: positional encoding / positional embedding 199 This gives the model information about token order. 200 201 - `next-character` or `next-token` prediction 202 Real term: autoregressive language modeling objective 203 This is the core pretraining task for a GPT-style model. 204 205 - `output projection to logits` 206 Real term: LM head 207 This maps the final hidden state into scores over the vocabulary. 208 209 - `fixed learned network over context` in `tiny_lm.py` 210 Real term: not a transformer block 211 This is just a baseline neural language model for learning the training loop. 212 213 - `query`, `key`, and `value` in `tiny_transformer_lm.py` 214 Real term: self-attention mechanism 215 This is the core transformer idea that lets tokens weigh other tokens dynamically. 216 217 - `multi-head attention` in `tiny_modern_lm.py` 218 Real term: multi-head self-attention 219 Real models use multiple heads so different relationships can be tracked in parallel. 220 221 - `feed-forward MLP` 222 Real term: transformer feed-forward network, often called FFN or MLP block 223 This is the per-position transformation that sits alongside attention. 224 225 - `layer norm` 226 Real term: LayerNorm or RMSNorm 227 Modern models rely on this for stable deep training. 228 229 - `residual connection` 230 Real term: residual / skip connection 231 This helps preserve information and makes deep optimization work better. 232 233 - `stacking many blocks` 234 Real term: transformer depth 235 A real LLM is many transformer blocks stacked on top of each other. 236 237 - `training on corpus.txt` 238 Real term: pretraining corpus 239 A real LLM uses a vastly larger and more varied dataset. 240 241 - `SGD in the toy scripts` 242 Real term: optimizer 243 Real models usually use AdamW or similar optimizers, not plain SGD. 244 245 - `sample generation` 246 Real term: inference / decoding 247 Real systems add practical features such as KV cache, batching, and quantization. 248 249 - `missing instruction tuning` 250 Real term: post-training / supervised fine-tuning 251 This is the stage where a pretrained model is taught to respond usefully to prompts. 252 253 - `missing preference optimization` 254 Real term: RLHF, DPO, or related alignment methods 255 This is where models are shaped to better match human preferences and policy goals. 256 257 The rough real-world sequence is: 258 259 1. Build a tokenizer and vocabulary. 260 2. Define a transformer architecture. 261 3. Pretrain it on next-token prediction. 262 4. Fine-tune or instruction-tune it for useful interaction. 263 5. Apply preference and safety training. 264 6. Optimize inference and deploy it.