PR #371

closed

Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale — val_bpb=1.1401

by mrdavtanView on GitHub
val_bpb
1.1401
Architecture
Transformer
Optimizer
SGD
Artifact Size
15.4 MB

Training Techniques

Architecture
XSA
Uses XSA in the last 4 layers of an 11-layer transformer stack.
parameters: {"layers":4}
U-Net skip connections
Adds skip connections before decoder blocks for a U-Net-like transformer structure.
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":16}
SmearGate
Gating mechanism using interpolation between current and previous activations.
parameters: null
BigramHash
Bigram-based hashing feature using XOR-based hashing with large primes and a learned output scalar.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Test-Time Training
TTT
parameters: {"epochs":3,"optimizer":"SGD"}
Regularization
LN Scale
parameters: {"scale_rule":"1/sqrt(layer+1)"}
Initialization
OrthoInit
Orthogonal initialization used for model weights.
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}

Novel Contributions

  • 11-layer transformer with XSA in the last 4 layers
  • EMA with decay 0.997
  • Test-time training with 3-epoch SGD
  • U-Net style skip connections
  • Partial RoPE on 16 of 64 dimensions
  • LayerNorm scaling by 1/sqrt(layer+1)
  • SmearGate and BigramHash additions
  • OrthoInit initialization
  • Late int6 QAT with absmax STE
  • Sliding-window evaluation with stride 32