PR #1528

open

Non-record: 11L s2048 4h on 1xA100 — 1.1104 BPB

by xiehuanyiView on GitHub
val_bpb
1.1104
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,040,603 bytes

Training Techniques

Architecture
XSA
XSA-all attention variant used in the stack
parameters: {"last_n":11}
BigramHash
Bigram hash embedding/feature component
parameters: {"vocab_size":2048}
Partial RoPE
Partial rotary positional embedding
parameters: {"dimensions":16,"denominator":64}
SmearGate
SmearGate gating mechanism
parameters: null
U-Net skip connections
U-Net style skip connections in the model
parameters: null
LeakyReLU
LeakyReLU squared activation
parameters: {"slope":0.5,"squared":true}
Regularization
LN scale
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"adamw":true}
Weight Averaging
EMA
parameters: {"decay":0.997,"start_fraction":0.2}
SWA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
late QAT
bits: null
scope: null
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
Deferred EMA start to avoid random-init contamination on shorter runs
parameters: {"start_fraction":0.2}
other
PyTorch SDP flash-backend fallback used when FA3 is unavailable
parameters: null

Novel Contributions

  • Longer context training at seq_len=2048
  • Extended training time to 4 hours on a single A100
  • A100-compatible fallback from FA3 to PyTorch SDP flash backend
  • Deferred EMA start for shorter runs
  • Int6 GPTQ + LZMA compressed submission under 16 MiB
  • Sliding window evaluation with stride 64