val_bpb
1.1104
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,040,603 bytes
Training Techniques
Architecture
XSA
XSA-all attention variant used in the stack
parameters: {"last_n":11}
BigramHash
Bigram hash embedding/feature component
parameters: {"vocab_size":2048}
Partial RoPE
Partial rotary positional embedding
parameters: {"dimensions":16,"denominator":64}
SmearGate
SmearGate gating mechanism
parameters: null
U-Net skip connections
U-Net style skip connections in the model
parameters: null
LeakyReLU
LeakyReLU squared activation
parameters: {"slope":0.5,"squared":true}
Regularization
LN scale
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"adamw":true}
Weight Averaging
EMA
parameters: {"decay":0.997,"start_fraction":0.2}
SWA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
late QAT
bits: null
scope: null
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
Deferred EMA start to avoid random-init contamination on shorter runs
parameters: {"start_fraction":0.2}
other
PyTorch SDP flash-backend fallback used when FA3 is unavailable
parameters: null
Novel Contributions
- Longer context training at seq_len=2048
- Extended training time to 4 hours on a single A100
- A100-compatible fallback from FA3 to PyTorch SDP flash backend
- Deferred EMA start for shorter runs
- Int6 GPTQ + LZMA compressed submission under 16 MiB
- Sliding window evaluation with stride 64