PR #325

open

Add Looped Transformer Design non-record submission (non tuned)

by Aum08DesaiView on GitHub
val_bpb
1.1462
Architecture
Looped Transformer
Optimizer
Muon
Artifact Size
15,589,099 bytes

Training Techniques

Architecture
depth recurrence / looped transformer
Transformer with a shared recurrent core and repeated looped execution to increase effective depth.
parameters: {"num_layers":6,"loop_core_layers":2,"loop_repeats":5,"loop_attn_every":2,"effective_executed_layers":14}
RoPE
Partial rotary positional embeddings applied only to a subset of dimensions.
parameters: {"dimensions":16}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":10,"num_kv_heads":5}
XSA
Includes XSA attention extras over the last few tokens.
parameters: {"last_n":4}
Bigram features
Adds token-side bigram vocabulary and embedding features.
parameters: {"bigram_vocab_size":2048,"bigram_dim":128}
Quantization
int6 QAT
bits: 6
scope: all
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"exact":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Regularization
layerwise LN scale
parameters: {"ln_scale":1}
Other
other
Late quantization-aware training applied after initial training.
parameters: {"late_qat":1,"qat_threshold":0.1}

Novel Contributions

  • Looped transformer with a shared recurrent core
  • Partial RoPE with LN scaling
  • Late QAT for int6 artifact fitting
  • XSA attention over the last 4 tokens
  • Bigram token-side features
  • Demonstrates a non-record recurrent-depth design point under the 10-minute and 16MB constraints