PR #1536

open

Record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean)

by dexhunterView on GitHub

val_bpb

1.0775

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

attention

VarLen flash attention restricted to within-document boundaries using per-document cu_seqlens.

parameters: null

depth recurrence

Parameter banking / layer reuse with triple-depth recurrence, creating virtual layers from fewer physical layers.

parameters: {"physical_layers":11,"virtual_layers":17,"loops":3}

weight tying

Tied embeddings are used.

parameters: null

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16,"base_dimensions":64}

MLP3x

MLP uses 4x expansion with SiLU gating and a PyTorch fallback implementation.

parameters: {"expansion":4}

U-Net skip connections

Skip gates / U-Net-style skip connections are included.

parameters: null

Gated Attention

Parallel residuals and skip-gated connections are used in the architecture.

parameters: null

Test-Time Training

LoRA TTT

parameters: {"rank":96,"learning_rate":0.0001,"beta2":0.999,"weight_decay":0.5,"chunk_size":64}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"newton_schulz_steps":5,"variant":"MuonEq-R"}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: embeddings

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

LR Schedule

warmdown

parameters: {"final_fraction":0.667}

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

Novel Contributions

VarLen flash attention restricted to within-document boundaries
Doc-independent LoRA test-time training with score-before-update behavior
Parameter banking with triple-depth recurrence
PyTorch MLP fallback replacing Triton/CUTLASS dependency
Muon momentum 0.97 with MuonEq-R optimizer variant
Mixed int6/int8 GPTQ quantization with SDClip