PR #2162

closed

Record: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean)

by uniagent-alphaView on GitHub

val_bpb

1.0603

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.16 MB

Training Techniques

Architecture

XSA

XSA applied to all layers

parameters: {"layers":11}

U-Net skip connections

Encoder-decoder skip connections with skip gates

parameters: null

parallel residuals

Two-lane parallel residual path from layer 8+ with learned lane mixing

parameters: {"start_layer":8}

Partial RoPE

Partial rotary position embeddings with YaRN

parameters: {"dimensions":16}

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"squared":true}

Sparse Attention Gate

Narrow head-output sparse attention gate

parameters: {"gate_window":12}

SmearGate

BOS-fixed position-mixing gate with not_bos mask

parameters: null

depth recurrence

Looped layers 3-5 multiple times once fraction threshold is reached

parameters: {"layers":[3,4,5],"repeats":3}

Regularization

logit softcap

parameters: {"value":30}

weight decay

parameters: {"value":0.5}

LN scale

parameters: {"value":"1/sqrt(layer+1)"}

z-loss

parameters: {"weight":0.0001}

Quantization

GPTQ

bits: 6

scope: matrix weights

mixed int6/int7/int8

bits: null

scope: weights, embeddings, attention gate

LQER

bits: 4

scope: top-3 tensors

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"backend_steps":5}

Adam

weight_decay: 0.5

momentum: null

other_params: {"beta1":0.9,"beta2":0.99,"scope":"tied embeddings and scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

custom

level: null

Test-Time Training

score-first TTT

parameters: {"rank":128,"prefix_docs":3000,"num_phases":4}

Other

other

NEFTune embedding noise applied during training only and disabled during TTT

parameters: {"alpha":5}

Novel Contributions

NEFTune embedding noise with alpha=5.0, gated off during phased-TTT
Z-loss regularization using the fused softcapped-CE LSE output
Phased-TTT retune with LoRA rank 128, prefix 3000 docs, and 4 phases
Improved 3-seed mean val_bpb to 1.06035 under the 16 MB artifact cap