PR #290

open

Record: 11L + Partial XSA + TTT + BatchOpt (val_bpb=1.1354)

by ibarrajoView on GitHub
val_bpb
1.1354
Architecture
11L Transformer
Optimizer
Muon + AdamW
Artifact Size
15.85 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Architecture
XSA
Partial exclusive self-attention applied only to the last 3 layers to debias self-attention efficiently in a GQA-aware way.
parameters: {"layers":3}
RoPE
Extended positional encoding using a larger RoPE base.
parameters: {"base":50000}
SmearGate
Custom gating mechanism used in the base architecture.
parameters: null
BigramHash
Bigram hashing with 2048 buckets used in the base architecture.
parameters: {"buckets":2048}
Test-Time Training
full TTT
parameters: {"epochs":3,"learning_rate":0.002,"freeze_blocks":2}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"learning_rate":0.025}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":7}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Initialization
OrthoInit
Orthogonal initialization used in the base architecture.

Novel Contributions

  • Partial XSA applied to the last 3 layers
  • Test-time training with 3-epoch full-model SGD and early block freezing
  • Batch size optimization to 524K tokens for more gradient updates
  • RoPE base increased to 50K
  • Sliding-window evaluation with stride 64
  • Int6 quantization with zstd-22 compression under the 16MB limit