tiny-llm-demo

tiny-llm-demo - small plain-Python LLM learning demos.
git clone git://git.beep.wimdupont.com/tiny-llm-demo.git
Log | Files | Refs | README | LICENSE

README.md (8945B)


      1 # tiny-llm-demo
      2 
      3 A very small language-model project meant to show the raw mechanics behind an
      4 LLM-like system without hiding them behind frameworks.
      5 
      6 These are not real LLMs. They are tiny plain-Python demos that progressively
      7 move from a simple neural next-token model toward transformer-like structure.
      8 The point is to make the moving parts visible:
      9 
     10 - text becomes tokens
     11 - tokens become vectors
     12 - vectors go through a small neural network
     13 - the network predicts the next token
     14 - training nudges weights so the prediction gets better
     15 
     16 ## Files
     17 
     18 - `tiny_lm.py`: model, training loop, and text generation
     19 - `tiny_transformer_lm.py`: second version with a minimal self-attention block
     20 - `tiny_modern_lm.py`: forward-only demo of a more modern transformer block
     21 - `corpus.txt`: small training text
     22 
     23 ## Run
     24 
     25 ```bash
     26 cd /home/dude/repositories/beep/tiny-llm-demo
     27 python3 tiny_lm.py
     28 ```
     29 
     30 Generate from a custom prompt:
     31 
     32 ```bash
     33 python3 tiny_lm.py --prompt "language models " --sample-length 200
     34 ```
     35 
     36 Run the attention-based version:
     37 
     38 ```bash
     39 python3 tiny_transformer_lm.py
     40 ```
     41 
     42 Show the learned attention weights over the prompt context:
     43 
     44 ```bash
     45 python3 tiny_transformer_lm.py --show-attention
     46 ```
     47 
     48 Inspect a more modern transformer-style block:
     49 
     50 ```bash
     51 python3 tiny_modern_lm.py
     52 ```
     53 
     54 ## What the code is showing
     55 
     56 The model has:
     57 
     58 1. A character vocabulary
     59 2. Token embeddings
     60 3. A context window that keeps token positions separate
     61 4. A hidden layer with `tanh`
     62 5. An output layer that predicts the next character
     63 
     64 At each training step:
     65 
     66 1. Take a short slice of text as context
     67 2. Ask the model for the next-character probabilities
     68 3. Compare them to the real next character
     69 4. Compute the loss
     70 5. Backpropagate the error
     71 6. Update the weights with SGD
     72 
     73 This is the same high-level story as a real LLM, just drastically smaller and
     74 with a much simpler architecture.
     75 
     76 For the sake of a short demo, the script repeats the small corpus several times
     77 by default so the model can learn visible patterns faster.
     78 
     79 ## What it is not showing
     80 
     81 - a full transformer stack
     82 - multi-head attention
     83 - parallel GPU training
     84 - distributed data loading
     85 - instruction tuning
     86 - RLHF / preference optimization
     87 - inference optimization
     88 
     89 Those are important in real systems, but this project is meant to answer the
     90 more basic question: what does it look like before all that complexity is added?
     91 
     92 ## Why there are two versions
     93 
     94 `tiny_lm.py` is the simpler baseline.
     95 
     96 - It takes the whole context window and feeds it through a fixed learned network.
     97 - Every position matters only because the code gives it a slot in the input.
     98 - There is no dynamic decision about which earlier token to focus on.
     99 
    100 `tiny_transformer_lm.py` is the more transformer-like version.
    101 
    102 - Each position gets its own vector.
    103 - The last position forms a query.
    104 - Earlier positions produce keys and values.
    105 - Attention scores decide which earlier positions matter for the next prediction.
    106 
    107 That is the conceptual jump toward transformers: instead of using one fixed
    108 computation over the context, the model learns to route information dynamically
    109 based on token-to-token interactions.
    110 
    111 In this toy version you can inspect that directly with `--show-attention`,
    112 which prints the weight assigned to each character position in the current
    113 prompt context.
    114 
    115 `tiny_modern_lm.py` goes one step closer to a real LLM block.
    116 
    117 - layer norm before sublayers
    118 - causal multi-head self-attention
    119 - residual connection after attention
    120 - feed-forward MLP
    121 - residual connection after the MLP
    122 - output projection to next-token logits
    123 
    124 This is much closer to the shape of a modern transformer block, but it is
    125 forward-only. The earlier two scripts are the training demos.
    126 
    127 ## Modern LLM Concepts
    128 
    129 If you want a compact mental model, these are the core ideas:
    130 
    131 1. Text is tokenized.
    132 2. Tokens become learned vectors called embeddings.
    133 3. Position information is added so order is preserved.
    134 4. The model applies many stacked transformer blocks.
    135 5. Each block uses self-attention so tokens can dynamically weigh other tokens.
    136 6. Each block also uses a feed-forward sublayer to transform each position.
    137 7. Residual connections help information and gradients flow through deep stacks.
    138 8. Layer normalization or RMSNorm helps stabilize training.
    139 9. The final vectors are projected into logits over the vocabulary.
    140 10. Training minimizes next-token prediction error over huge corpora.
    141 
    142 ## What Each Script Represents
    143 
    144 `tiny_lm.py`
    145 
    146 - Shows raw neural next-token training.
    147 - The context is pushed through a fixed learned network.
    148 - Good for understanding random initialization, loss, backprop, and SGD.
    149 
    150 `tiny_transformer_lm.py`
    151 
    152 - Shows the key jump toward transformer behavior.
    153 - The model uses query/key/value attention to weight context positions dynamically.
    154 - Good for understanding why attention is different from a fixed MLP over context.
    155 
    156 `tiny_modern_lm.py`
    157 
    158 - Shows the architectural shape of a more modern LLM block.
    159 - Adds multi-head attention, layer norm, feed-forward layers, and residual paths.
    160 - Good for understanding how modern transformer blocks are composed.
    161 
    162 ## What Is Still Missing From These Toys
    163 
    164 Even with the third script, these demos are still missing important parts of a
    165 real modern LLM:
    166 
    167 - many stacked transformer blocks instead of one tiny block
    168 - real training for the full modern block
    169 - subword tokenization such as BPE instead of character-level tokens
    170 - large-scale optimization such as AdamW, schedules, clipping, and mixed precision
    171 - large curated datasets and large-scale data pipelines
    172 - efficient inference features such as KV cache, batching, and quantization
    173 - post-training such as instruction tuning and preference optimization
    174 - deployment and serving systems
    175 
    176 So the rough ladder is:
    177 
    178 - `tiny_lm.py`: neural language model basics
    179 - `tiny_transformer_lm.py`: attention basics
    180 - `tiny_modern_lm.py`: modern transformer block shape
    181 
    182 That gets you much closer to a modern LLM conceptually, even though the scale
    183 and training sophistication are still far away.
    184 
    185 ## Mapping To Real LLM Terms
    186 
    187 If you want to connect these demos to standard terminology, use this table:
    188 
    189 - `character vocabulary` / `stoi` / `itos`
    190   Real term: tokenizer vocabulary
    191   In real systems this is usually a subword vocabulary such as BPE tokens, not characters.
    192 
    193 - `token embeddings`
    194   Real term: embedding layer
    195   This is the learned lookup table that turns token IDs into dense vectors.
    196 
    197 - `positional embeddings`
    198   Real term: positional encoding / positional embedding
    199   This gives the model information about token order.
    200 
    201 - `next-character` or `next-token` prediction
    202   Real term: autoregressive language modeling objective
    203   This is the core pretraining task for a GPT-style model.
    204 
    205 - `output projection to logits`
    206   Real term: LM head
    207   This maps the final hidden state into scores over the vocabulary.
    208 
    209 - `fixed learned network over context` in `tiny_lm.py`
    210   Real term: not a transformer block
    211   This is just a baseline neural language model for learning the training loop.
    212 
    213 - `query`, `key`, and `value` in `tiny_transformer_lm.py`
    214   Real term: self-attention mechanism
    215   This is the core transformer idea that lets tokens weigh other tokens dynamically.
    216 
    217 - `multi-head attention` in `tiny_modern_lm.py`
    218   Real term: multi-head self-attention
    219   Real models use multiple heads so different relationships can be tracked in parallel.
    220 
    221 - `feed-forward MLP`
    222   Real term: transformer feed-forward network, often called FFN or MLP block
    223   This is the per-position transformation that sits alongside attention.
    224 
    225 - `layer norm`
    226   Real term: LayerNorm or RMSNorm
    227   Modern models rely on this for stable deep training.
    228 
    229 - `residual connection`
    230   Real term: residual / skip connection
    231   This helps preserve information and makes deep optimization work better.
    232 
    233 - `stacking many blocks`
    234   Real term: transformer depth
    235   A real LLM is many transformer blocks stacked on top of each other.
    236 
    237 - `training on corpus.txt`
    238   Real term: pretraining corpus
    239   A real LLM uses a vastly larger and more varied dataset.
    240 
    241 - `SGD in the toy scripts`
    242   Real term: optimizer
    243   Real models usually use AdamW or similar optimizers, not plain SGD.
    244 
    245 - `sample generation`
    246   Real term: inference / decoding
    247   Real systems add practical features such as KV cache, batching, and quantization.
    248 
    249 - `missing instruction tuning`
    250   Real term: post-training / supervised fine-tuning
    251   This is the stage where a pretrained model is taught to respond usefully to prompts.
    252 
    253 - `missing preference optimization`
    254   Real term: RLHF, DPO, or related alignment methods
    255   This is where models are shaped to better match human preferences and policy goals.
    256 
    257 The rough real-world sequence is:
    258 
    259 1. Build a tokenizer and vocabulary.
    260 2. Define a transformer architecture.
    261 3. Pretrain it on next-token prediction.
    262 4. Fine-tune or instruction-tune it for useful interaction.
    263 5. Apply preference and safety training.
    264 6. Optimize inference and deploy it.