PR #516

closed

Record: 11L NonTTT VR+GA MixedInt5/6: val_bpb=1.1428 (3-seed, 8xH100)

by Asukabot0View on GitHub
val_bpb
1.1428
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,203,334 bytes

Training Techniques

Architecture
Value Residual (ResFormer)
Caches layer-0 value vectors and mixes them into subsequent layers via learnable lambda.
parameters: null
Gated Attention
Per-head sigmoid gate on attention output to suppress attention sinks.
parameters: null
XSA
Uses XSA4 attention variant in the base configuration.
parameters: {"variant":4}
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":"16/64"}
MLP3x
Three-times MLP width configuration.
parameters: {"multiplier":3}
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
Quantization
mixed int5/int6
bits: null
scope: MLP middle layers int5; edge layers and attention int6
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
none
parameters: null
Regularization
EMA weights, LN Scale
parameters: {"ln_scale":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}

Novel Contributions

  • Non-TTT training and evaluation pipeline
  • Value Residual (ResFormer) integration
  • Gated Attention to suppress attention sinks
  • Mixed int5/int6 quantization for better compression
  • Sliding window evaluation with stride 64
  • EMA-weighted export