PR #2163

open

Record: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean)

by uniagent-alphaView on GitHub
val_bpb
1.0603
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.16 MB

Training Techniques

Architecture
XSA
XSA applied to all layers
parameters: {"layers":11}
U-Net skip connections
Encoder-decoder skip connections with skip gates
parameters: null
parallel residuals
Two-lane parallel residual path from later layers with learned lane mixing
parameters: {"start_layer":8}
Partial RoPE
Partial rotary position embeddings combined with YaRN
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"slope":0.5}
SmearGate
BOS-fixed position-mixing gate with not-BOS masking
parameters: null
Gated Attention
Sparse attention head-output gate
parameters: {"gate_window":12}
depth recurrence
Loops layers 3-5 multiple times once a fraction threshold is reached
parameters: {"layers":[3,4,5],"repeats":3,"threshold_frac":0.35}
weight tying
Tied embeddings
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
logit softcap
Softcapped logits used in training
parameters: {"value":30}
Quantization
GPTQ
bits: 6
scope: matrix weights
mixed int7/int8
bits: 7
scope: embeddings and attention gates
LQER
bits: 4
scope: top-3 tensors
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":5}
Adam
weight_decay: 0.5
momentum: null
other_params: {"beta1":0.9,"beta2":0.99,"scope":"tied embeddings and scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
pergroup
level: null
Test-Time Training
Phased TTT
parameters: {"rank":128,"prefix_docs":3000,"num_phases":4,"per_doc_reset":true,"score_first":true}
Regularization
weight decay
parameters: {"value":0.5}
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
NEFTune
parameters: {"alpha":5,"training_only":true,"disabled_during_ttt":true}
z-loss
parameters: {"weight":0.0001}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_frac":0.85,"min_lr":0.1}

Novel Contributions

  • NEFTune embedding noise added during training and disabled during phased TTT
  • Z-loss regularization using fused softcapped-CE log-sum-exp output
  • Phased TTT retune with LoRA rank increased to 128, prefix length increased to 3000 docs, and phases increased to 4
  • Combined GPTQ int6, int7 embeddings, int8 attention-gate quantization, and LQER rank-4 correction under the 16 MB cap