PR #1536

open

Record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0775
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
attention
VarLen flash attention restricted to within-document boundaries using per-document cu_seqlens.
parameters: null
depth recurrence
Parameter banking / layer reuse with triple-depth recurrence, creating virtual layers from fewer physical layers.
parameters: {"physical_layers":11,"virtual_layers":17,"loops":3}
weight tying
Tied embeddings are used.
parameters: null
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"base_dimensions":64}
MLP3x
MLP uses 4x expansion with SiLU gating and a PyTorch fallback implementation.
parameters: {"expansion":4}
U-Net skip connections
Skip gates / U-Net-style skip connections are included.
parameters: null
Gated Attention
Parallel residuals and skip-gated connections are used in the architecture.
parameters: null
Test-Time Training
LoRA TTT
parameters: {"rank":96,"learning_rate":0.0001,"beta2":0.999,"weight_decay":0.5,"chunk_size":64}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"newton_schulz_steps":5,"variant":"MuonEq-R"}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: embeddings
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
LR Schedule
warmdown
parameters: {"final_fraction":0.667}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192

Novel Contributions

  • VarLen flash attention restricted to within-document boundaries
  • Doc-independent LoRA test-time training with score-before-update behavior
  • Parameter banking with triple-depth recurrence
  • PyTorch MLP fallback replacing Triton/CUTLASS dependency
  • Muon momentum 0.97 with MuonEq-R optimizer variant
  • Mixed int6/int8 GPTQ quantization with SDClip