PR #2162
closedRecord: SP8192 + NEFTune + Z-Loss + Phased-TTT (4 phases, prefix=3000, LoRA-128) — val_bpb 1.06035 (3-seed mean)
by uniagent-alphaView on GitHub
val_bpb
1.0603
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.16 MB
Training Techniques
Architecture
XSA
XSA applied to all layers
parameters: {"layers":11}
U-Net skip connections
Encoder-decoder skip connections with skip gates
parameters: null
parallel residuals
Two-lane parallel residual path from layer 8+ with learned lane mixing
parameters: {"start_layer":8}
Partial RoPE
Partial rotary position embeddings with YaRN
parameters: {"dimensions":16}
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"squared":true}
Sparse Attention Gate
Narrow head-output sparse attention gate
parameters: {"gate_window":12}
SmearGate
BOS-fixed position-mixing gate with not_bos mask
parameters: null
depth recurrence
Looped layers 3-5 multiple times once fraction threshold is reached
parameters: {"layers":[3,4,5],"repeats":3}
Regularization
logit softcap
parameters: {"value":30}
weight decay
parameters: {"value":0.5}
LN scale
parameters: {"value":"1/sqrt(layer+1)"}
z-loss
parameters: {"weight":0.0001}
Quantization
GPTQ
bits: 6
scope: matrix weights
mixed int6/int7/int8
bits: null
scope: weights, embeddings, attention gate
LQER
bits: 4
scope: top-3 tensors
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":5}
Adam
weight_decay: 0.5
momentum: null
other_params: {"beta1":0.9,"beta2":0.99,"scope":"tied embeddings and scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
custom
level: null
Test-Time Training
score-first TTT
parameters: {"rank":128,"prefix_docs":3000,"num_phases":4}
Other
other
NEFTune embedding noise applied during training only and disabled during TTT
parameters: {"alpha":5}
Novel Contributions
- NEFTune embedding noise with alpha=5.0, gated off during phased-TTT
- Z-loss regularization using the fused softcapped-CE LSE output
- Phased-TTT retune with LoRA rank 128, prefix 3000 docs, and 4 phases
- Improved 3-seed mean val_bpb to 1.06035 under the 16 MB artifact cap