PR #1623
openRecord submission: Distill+IntraLoop SP1024 9x512 (val_bpb=1.1942)
by divagr18View on GitHub
val_bpb
1.1942
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.6MB
Training Techniques
Architecture
depth recurrence
Partial intra-loop recurrence where layers 3-4 are executed twice, yielding 11 effective layers from 9 physical layers.
parameters: {"layers":[3,4],"repeats":2,"physical_layers":9,"effective_layers":11}
GQA
Grouped Query Attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
weight tying
Input and output embeddings share weights.
parameters: null
U-Net skip connections
Skip connections between the first and second half of the layer stack.
parameters: null
BigramHash
Residual bigram head mixed with model logits at inference time.
parameters: {"rank":32}
Weight Averaging
EMA
parameters: {"decay":0.999,"weight":0.08,"temp":2,"start_frac":0.7}
SWA
parameters: {"snapshots":282}
Quantization
GPTQ
bits: 8
scope: all
Compression
zstd
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_for":["embeddings","scalar parameters"]}
Other
other
QK-Gain initialization with learnable per-head gain parameters for query and key projections.
parameters: {"init":5}
other
SwiGLU activation in the MLP.
parameters: null
Novel Contributions
- Partial depth recurrence applied only to middle layers for near-zero parameter cost
- EMA self-distillation during the final portion of training
- GPTQ int8 post-training quantization with low roundtrip penalty
- Combination of SWA, QK-Gain, GQA, SwiGLU, Muon, tied embeddings, and residual bigram head