PR #1467

open

Non-record: XSA-11 + Parallel Residual (L7+) + Depth Recurrence — val_bpb 1.1056 (1-seed, 1×H100)

by PhamPhuHoa-23View on GitHub
val_bpb
1.1056
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.65 MB

Training Techniques

Architecture
XSA
Applied XSA to all 11 layers
parameters: {"layers":11}
BigramHash
Bigram hash embedding component
parameters: {"dimensions":112,"size":3072}
Partial RoPE
Rotary positional embedding applied partially
parameters: {"dimensions":16,"base_dimensions":64}
VE128
Value residual enhancement in selected layers
parameters: {"layers":[9,10]}
SmearGate
Position-mixing gate
parameters: null
Parallel Residual
Parallel residual connections in later layers
parameters: {"start_layer":7}
depth recurrence
Depth recurrence activated during training
parameters: {"layers":[4,5],"start_step":3000}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ
bits: 6
scope: all
STE QAT
bits: null
scope: all
Compression
Brotli
level: 11
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • XSA applied across all 11 layers
  • Parallel residual connections starting at layer 7
  • Depth recurrence in layers 4 and 5 with delayed activation
  • EMA-selected checkpoint over SWA
  • Full Hessian GPTQ int6 quantization with AR self-generated calibration sequences
  • Sliding window evaluation with stride 64