PR #2163

open

Record: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean)

by uniagent-alphaView on GitHub

val_bpb

1.0603

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.16 MB

Training Techniques

Architecture

XSA

XSA applied to all layers

parameters: {"layers":11}

U-Net skip connections

Encoder-decoder skip connections with skip gates

parameters: null

parallel residuals

Two-lane parallel residual path from later layers with learned lane mixing

parameters: {"start_layer":8}

Partial RoPE

Partial rotary position embeddings combined with YaRN

parameters: {"dimensions":16,"total_dimensions":64}

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"slope":0.5}

SmearGate

BOS-fixed position-mixing gate with not-BOS masking

parameters: null

Gated Attention

Sparse attention head-output gate

parameters: {"gate_window":12}

depth recurrence

Loops layers 3-5 multiple times once a fraction threshold is reached

parameters: {"layers":[3,4,5],"repeats":3,"threshold_frac":0.35}

weight tying

Tied embeddings

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

logit softcap

Softcapped logits used in training

parameters: {"value":30}

Quantization

GPTQ

bits: 6

scope: matrix weights

mixed int7/int8

bits: 7

scope: embeddings and attention gates

LQER

bits: 4

scope: top-3 tensors

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"backend_steps":5}

Adam

weight_decay: 0.5

momentum: null

other_params: {"beta1":0.9,"beta2":0.99,"scope":"tied embeddings and scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

pergroup

level: null

Test-Time Training

Phased TTT

parameters: {"rank":128,"prefix_docs":3000,"num_phases":4,"per_doc_reset":true,"score_first":true}

Regularization

weight decay

parameters: {"value":0.5}

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

NEFTune

parameters: {"alpha":5,"training_only":true,"disabled_during_ttt":true}

z-loss

parameters: {"weight":0.0001}

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_frac":0.85,"min_lr":0.1}

Novel Contributions

NEFTune embedding noise added during training and disabled during phased TTT
Z-loss regularization using fused softcapped-CE log-sum-exp output
Phased TTT retune with LoRA rank increased to 128, prefix length increased to 3000 docs, and phases increased to 4
Combined GPTQ int6, int7 embeddings, int8 attention-gate quantization, and LQER rank-4 correction under the 16 MB cap