PR #1696

open

[Record] Block Attention Residuals + Tuned Legal TTT — val_bpb 1.12242 (8xH100 primary)

by kings-crownView on GitHub

val_bpb

1.1224

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.72 MB

Training Techniques

Architecture

XSA

Cross-sequence attention applied to all 11 layers.

parameters: {"layers":11}

LeakyReLU

Uses LeakyReLU(0.5)^2 MLP activation.

parameters: {"mlp_multiplier":3}

Partial RoPE

Partial rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

weight tying

Tied token embeddings.

parameters: null

VE128

VE128 used on layers 9-10.

parameters: {"layers":[9,10]}

U-Net skip connections

Encoder-decoder U-Net skip handling is threaded through the banked layout.

parameters: null

Block Attention Residuals

Depth-attention residual routing over detached block boundary source banks.

parameters: {"blocks":2,"mix":0.25,"temperature":1.1}

Quantization

mixed int6

bits: 6

scope: attention/MLP banks

late QAT

bits: null

scope: full model

Compression

lzma

level: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.003,"epochs":3,"freeze_blocks":1,"chunk_tokens":32768}

Evaluation

sliding window eval

parameters: {"single_pass":true,"non_overlapping_segments":true}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_interval":50}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

cosine decay

parameters: {"applied_to":"TTT"}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"muon_momentum_warmup_steps":1500}

SGD

weight_decay: null

momentum: 0.9

other_params: {"used_for":"TTT"}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

Block Attention Residuals integrated into the parameter-banked architecture
Detached depth source banks with zero-initialized depth queries for depth routing
Tuned legal score-first TTT with improved LR and frozen blocks
XSA extended to all 11 layers
Single-pass non-overlapping sliding evaluation and TTT scoring