PR #1696

open

[Record] Block Attention Residuals + Tuned Legal TTT — val_bpb 1.12242 (8xH100 primary)

by kings-crownView on GitHub
val_bpb
1.1224
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.72 MB

Training Techniques

Architecture
XSA
Cross-sequence attention applied to all 11 layers.
parameters: {"layers":11}
LeakyReLU
Uses LeakyReLU(0.5)^2 MLP activation.
parameters: {"mlp_multiplier":3}
Partial RoPE
Partial rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
weight tying
Tied token embeddings.
parameters: null
VE128
VE128 used on layers 9-10.
parameters: {"layers":[9,10]}
U-Net skip connections
Encoder-decoder U-Net skip handling is threaded through the banked layout.
parameters: null
Block Attention Residuals
Depth-attention residual routing over detached block boundary source banks.
parameters: {"blocks":2,"mix":0.25,"temperature":1.1}
Quantization
mixed int6
bits: 6
scope: attention/MLP banks
late QAT
bits: null
scope: full model
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.003,"epochs":3,"freeze_blocks":1,"chunk_tokens":32768}
Evaluation
sliding window eval
parameters: {"single_pass":true,"non_overlapping_segments":true}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
cosine decay
parameters: {"applied_to":"TTT"}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"muon_momentum_warmup_steps":1500}
SGD
weight_decay: null
momentum: 0.9
other_params: {"used_for":"TTT"}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

  • Block Attention Residuals integrated into the parameter-banked architecture
  • Detached depth source banks with zero-initialized depth queries for depth routing
  • Tuned legal score-first TTT with improved LR and frozen blocks
  • XSA extended to all 11 layers
  • Single-pass non-overlapping sliding evaluation and TTT scoring