val_bpb
1.1659
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,070,662 bytes
Training Techniques
Architecture
Memory Tokens
64 learnable embedding vectors overwrite/prepend the first K positions of each sequence to provide shared global context scratchpad access via causal attention.
parameters: {"tokens":64}
MLP3x
Transformer uses 3x MLP expansion.
parameters: {"multiplier":3}
BigramHash
Hashes consecutive token pairs to inject local context through a BigramHashEmbedding.
parameters: {"vocab_size":10240}
SmearGate
Learned blend with the previous token at the embedding level.
parameters: null
Partial RoPE
Applies rotary position encoding to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"attention_heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"matrix_lr":0.04}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"scope":"embed/scalar"}
Weight Averaging
EMA
parameters: {"decay":0.997,"every_steps":10}
Quantization
mixed int5/int6 QAT
bits: null
scope: MLP weights and attention weights
fp16
bits: 16
scope: tied embeddings and small tensors
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":128,"seq_len":1024,"batched_windows":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"memory_tokens_exempt":true,"weight_decay":0.04}
Other
other
Late QAT with fake int6 quantization (STE) when lr_scale < 0.1.
parameters: {"lr_scale_threshold":0.1,"quant_bits":6}
Novel Contributions
- Memory tokens: 64 learnable embedding vectors used as a global context scratchpad.
- A/B tested improvement from memory tokens of -0.014 BPB versus an identical config without them.
- Mixed quantization scheme using int5 for MLP weights and int6 for attention weights.
- Batched sliding-window evaluation with compiled forward_logits.