PR #334

open

Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 107min, val_bpb=1.2207, 15.4MB)

by nathon-leeView on GitHub
val_bpb
1.2207
Architecture
GPT
Optimizer
Muon
Artifact Size
15.4 MB

Training Techniques

Architecture
Partial RoPE
Applies rotary position encoding to only a subset of head dimensions.
parameters: {"dimensions":16,"total_head_dims":64}
SmearGate
Per-dimension gate blending current and previous token embeddings.
parameters: null
BigramHash
Hash-based bigram context embeddings.
parameters: {"buckets":2048,"dim":64}
U-Net skip connections
Encoder-decoder style skip connections with learnable weights.
parameters: null
KV head count
Uses 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Uses a 3x ReluSquared MLP.
parameters: null
Regularization
LN scale
parameters: {"formula":"RMSNorm damped by 1/sqrt(layer+1)"}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz":true}
Adam
weight_decay: 0.04
momentum: null
other_params: {"beta1":0.9,"beta2":0.95,"used_for":"scalars/embeddings"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start":"last 40% of training"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"epochs":3,"frozen_blocks":2}
Initialization
OrthoInit
Orthogonal initialization with muP output-projection scaling.
LR Schedule
cosine warmdown
parameters: {"warmdown_steps":3000}

Novel Contributions

  • 11-layer 512-dim GPT architecture with 8 attention heads and 4 KV heads
  • Partial RoPE applied to only 16 of 64 head dimensions
  • LN Scale using RMSNorm damped by 1/sqrt(layer+1)
  • SmearGate token blending mechanism
  • BigramHash context embeddings with 2048 buckets and 64 dimensions
  • U-Net style skip connections with learnable weights
  • Muon optimizer combined with Adam for embeddings/scalars
  • EMA plus SWA weight averaging
  • Uniform int5 quantization with zstd-22 compression
  • Sliding-window evaluation and full-model test-time training