val_bpb
1.1642
Architecture
Transformer
Optimizer
Muon/AdamW
Artifact Size
15,635,201 bytes
Training Techniques
Architecture
XSA
Applied XSA on the last 4 layers.
parameters: {"layers":4}
MLP3x
Used 3x MLP blocks.
parameters: {"multiplier":3}
SmearGate
Included SmearGate in the model.
parameters: null
BigramHash
Used BigramHash feature with 2048 buckets.
parameters: {"buckets":2048}
Partial RoPE
Applied partial RoPE with reduced rotary dimensions.
parameters: {"dimensions":16}
GPTQ-lite
Used GPTQ-lite clip search.
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
mixed int6
bits: 6
scope: all
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon/AdamW
weight_decay: 0.04
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
layerwise LN scale
parameters: null
LR Schedule
warmdown3500
parameters: {"warmdown_steps":3500}
Novel Contributions
- Merged top-stack recipe built from public leaderboard lineage
- 11-layer model with XSA on the last 4 layers
- EMA-only training
- 3x MLP blocks
- SmearGate integration
- BigramHash with 2048 buckets
- Mixed int6 quantization with zstd compression
- Sliding-window evaluation with stride 64
- Partial RoPE with ROPE_DIMS=16
- Layerwise LN scaling
- GPTQ-lite clip search
- Clean rerun package with strict runtime gates for uninterrupted 8x H100 execution