README.md - tiny-llm-demo - tiny-llm-demo - small plain-Python LLM learning demos.

README.md (8945B)

1 # tiny-llm-demo
2
3 A very small language-model project meant to show the raw mechanics behind an
4 LLM-like system without hiding them behind frameworks.
5
6 These are not real LLMs. They are tiny plain-Python demos that progressively
7 move from a simple neural next-token model toward transformer-like structure.
8 The point is to make the moving parts visible:
9
10 - text becomes tokens
11 - tokens become vectors
12 - vectors go through a small neural network
13 - the network predicts the next token
14 - training nudges weights so the prediction gets better
15
16 ## Files
17
18 - `tiny_lm.py`: model, training loop, and text generation
19 - `tiny_transformer_lm.py`: second version with a minimal self-attention block
20 - `tiny_modern_lm.py`: forward-only demo of a more modern transformer block
21 - `corpus.txt`: small training text
22
23 ## Run
24
25 ```bash
26 cd /home/dude/repositories/beep/tiny-llm-demo
27 python3 tiny_lm.py
28 ```
29
30 Generate from a custom prompt:
31
32 ```bash
33 python3 tiny_lm.py --prompt "language models " --sample-length 200
34 ```
35
36 Run the attention-based version:
37
38 ```bash
39 python3 tiny_transformer_lm.py
40 ```
41
42 Show the learned attention weights over the prompt context:
43
44 ```bash
45 python3 tiny_transformer_lm.py --show-attention
46 ```
47
48 Inspect a more modern transformer-style block:
49
50 ```bash
51 python3 tiny_modern_lm.py
52 ```
53
54 ## What the code is showing
55
56 The model has:
57
58 1. A character vocabulary
59 2. Token embeddings
60 3. A context window that keeps token positions separate
61 4. A hidden layer with `tanh`
62 5. An output layer that predicts the next character
63
64 At each training step:
65
66 1. Take a short slice of text as context
67 2. Ask the model for the next-character probabilities
68 3. Compare them to the real next character
69 4. Compute the loss
70 5. Backpropagate the error
71 6. Update the weights with SGD
72
73 This is the same high-level story as a real LLM, just drastically smaller and
74 with a much simpler architecture.
75
76 For the sake of a short demo, the script repeats the small corpus several times
77 by default so the model can learn visible patterns faster.
78
79 ## What it is not showing
80
81 - a full transformer stack
82 - multi-head attention
83 - parallel GPU training
84 - distributed data loading
85 - instruction tuning
86 - RLHF / preference optimization
87 - inference optimization
88
89 Those are important in real systems, but this project is meant to answer the
90 more basic question: what does it look like before all that complexity is added?
91
92 ## Why there are two versions
93
94 `tiny_lm.py` is the simpler baseline.
95
96 - It takes the whole context window and feeds it through a fixed learned network.
97 - Every position matters only because the code gives it a slot in the input.
98 - There is no dynamic decision about which earlier token to focus on.
99
100 `tiny_transformer_lm.py` is the more transformer-like version.
101
102 - Each position gets its own vector.
103 - The last position forms a query.
104 - Earlier positions produce keys and values.
105 - Attention scores decide which earlier positions matter for the next prediction.
106
107 That is the conceptual jump toward transformers: instead of using one fixed
108 computation over the context, the model learns to route information dynamically
109 based on token-to-token interactions.
110
111 In this toy version you can inspect that directly with `--show-attention`,
112 which prints the weight assigned to each character position in the current
113 prompt context.
114
115 `tiny_modern_lm.py` goes one step closer to a real LLM block.
116
117 - layer norm before sublayers
118 - causal multi-head self-attention
119 - residual connection after attention
120 - feed-forward MLP
121 - residual connection after the MLP
122 - output projection to next-token logits
123
124 This is much closer to the shape of a modern transformer block, but it is
125 forward-only. The earlier two scripts are the training demos.
126
127 ## Modern LLM Concepts
128
129 If you want a compact mental model, these are the core ideas:
130
131 1. Text is tokenized.
132 2. Tokens become learned vectors called embeddings.
133 3. Position information is added so order is preserved.
134 4. The model applies many stacked transformer blocks.
135 5. Each block uses self-attention so tokens can dynamically weigh other tokens.
136 6. Each block also uses a feed-forward sublayer to transform each position.
137 7. Residual connections help information and gradients flow through deep stacks.
138 8. Layer normalization or RMSNorm helps stabilize training.
139 9. The final vectors are projected into logits over the vocabulary.
140 10. Training minimizes next-token prediction error over huge corpora.
141
142 ## What Each Script Represents
143
144 `tiny_lm.py`
145
146 - Shows raw neural next-token training.
147 - The context is pushed through a fixed learned network.
148 - Good for understanding random initialization, loss, backprop, and SGD.
149
150 `tiny_transformer_lm.py`
151
152 - Shows the key jump toward transformer behavior.
153 - The model uses query/key/value attention to weight context positions dynamically.
154 - Good for understanding why attention is different from a fixed MLP over context.
155
156 `tiny_modern_lm.py`
157
158 - Shows the architectural shape of a more modern LLM block.
159 - Adds multi-head attention, layer norm, feed-forward layers, and residual paths.
160 - Good for understanding how modern transformer blocks are composed.
161
162 ## What Is Still Missing From These Toys
163
164 Even with the third script, these demos are still missing important parts of a
165 real modern LLM:
166
167 - many stacked transformer blocks instead of one tiny block
168 - real training for the full modern block
169 - subword tokenization such as BPE instead of character-level tokens
170 - large-scale optimization such as AdamW, schedules, clipping, and mixed precision
171 - large curated datasets and large-scale data pipelines
172 - efficient inference features such as KV cache, batching, and quantization
173 - post-training such as instruction tuning and preference optimization
174 - deployment and serving systems
175
176 So the rough ladder is:
177
178 - `tiny_lm.py`: neural language model basics
179 - `tiny_transformer_lm.py`: attention basics
180 - `tiny_modern_lm.py`: modern transformer block shape
181
182 That gets you much closer to a modern LLM conceptually, even though the scale
183 and training sophistication are still far away.
184
185 ## Mapping To Real LLM Terms
186
187 If you want to connect these demos to standard terminology, use this table:
188
189 - `character vocabulary` / `stoi` / `itos`
190 Real term: tokenizer vocabulary
191 In real systems this is usually a subword vocabulary such as BPE tokens, not characters.
192
193 - `token embeddings`
194 Real term: embedding layer
195 This is the learned lookup table that turns token IDs into dense vectors.
196
197 - `positional embeddings`
198 Real term: positional encoding / positional embedding
199 This gives the model information about token order.
200
201 - `next-character` or `next-token` prediction
202 Real term: autoregressive language modeling objective
203 This is the core pretraining task for a GPT-style model.
204
205 - `output projection to logits`
206 Real term: LM head
207 This maps the final hidden state into scores over the vocabulary.
208
209 - `fixed learned network over context` in `tiny_lm.py`
210 Real term: not a transformer block
211 This is just a baseline neural language model for learning the training loop.
212
213 - `query`, `key`, and `value` in `tiny_transformer_lm.py`
214 Real term: self-attention mechanism
215 This is the core transformer idea that lets tokens weigh other tokens dynamically.
216
217 - `multi-head attention` in `tiny_modern_lm.py`
218 Real term: multi-head self-attention
219 Real models use multiple heads so different relationships can be tracked in parallel.
220
221 - `feed-forward MLP`
222 Real term: transformer feed-forward network, often called FFN or MLP block
223 This is the per-position transformation that sits alongside attention.
224
225 - `layer norm`
226 Real term: LayerNorm or RMSNorm
227 Modern models rely on this for stable deep training.
228
229 - `residual connection`
230 Real term: residual / skip connection
231 This helps preserve information and makes deep optimization work better.
232
233 - `stacking many blocks`
234 Real term: transformer depth
235 A real LLM is many transformer blocks stacked on top of each other.
236
237 - `training on corpus.txt`
238 Real term: pretraining corpus
239 A real LLM uses a vastly larger and more varied dataset.
240
241 - `SGD in the toy scripts`
242 Real term: optimizer
243 Real models usually use AdamW or similar optimizers, not plain SGD.
244
245 - `sample generation`
246 Real term: inference / decoding
247 Real systems add practical features such as KV cache, batching, and quantization.
248
249 - `missing instruction tuning`
250 Real term: post-training / supervised fine-tuning
251 This is the stage where a pretrained model is taught to respond usefully to prompts.
252
253 - `missing preference optimization`
254 Real term: RLHF, DPO, or related alignment methods
255 This is where models are shaped to better match human preferences and policy goals.
256
257 The rough real-world sequence is:
258
259 1. Build a tokenizer and vocabulary.
260 2. Define a transformer architecture.
261 3. Pretrain it on next-token prediction.
262 4. Fine-tune or instruction-tune it for useful interaction.
263 5. Apply preference and safety training.
264 6. Optimize inference and deploy it.

	tiny-llm-demo tiny-llm-demo - small plain-Python LLM learning demos.
	git clone git://git.beep.wimdupont.com/tiny-llm-demo.git
	Log \| Files \| Refs \| README \| LICENSE