PR #928

open

Non-record: XSA-all + mHC + Full QAT (val_bpb=1.1211)

by autocode-rayesView on GitHub
val_bpb
1.1211
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.95 MB

Training Techniques

Architecture
XSA
Cross-sequence attention applied to all 11 layers instead of only the last 4 layers.
parameters: {"layers":11}
other
Manifold-constrained hyper-connections with learnable alpha/beta residual mixing per block under a norm constraint.
parameters: {"extra_params":22}
Quantization
QAT
bits: 6
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"enabled":1}

Novel Contributions

  • XSA applied to all 11 layers
  • Manifold-constrained hyper-connections with 22 extra parameters
  • Full-training QAT from step 1
  • Parallel Muon optimizer stack
  • Sliding window evaluation and legal TTT improvement