PR #1215
open12L rANS + LeakyReLU(0.95)² + Soft XSA (1.1601 BPB, non_record_16mb)
by turbo-indubitableView on GitHub
val_bpb
1.1601
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,912,601 bytes
Training Techniques
Compression
rANS
level: null
Quantization
mixed int5/int6
bits: null
scope: all weights
Architecture
LeakyReLU
LeakyReLU squared activation with slope 0.95
parameters: {"slope":0.95,"power":2}
XSA
Soft XSA with learned per-head alpha on all layers using a position-tiled Triton kernel
parameters: {"layers":12,"learned_per_head_alpha":true}
BigramHash
Removed bigram vocabulary / bigram hash component
parameters: {"bigram_vocab_size":0}
Regularization
magnitude pruning
parameters: {"prune_pct":0.1,"scope":"2D tensors > 65536 params"}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"backend_steps":5}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":20}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
CMS n-gram eval cache with bilateral confidence gating and product mixing
parameters: {"orders":"3-7","hashes":4,"counters":"64M"}
Novel Contributions
- Per-tensor adaptive rANS replacing zstd-22
- LeakyReLU(0.95)² activation sweep winner
- Soft XSA with learned per-head alpha on all layers
- Position-tiled Triton kernel for XSA with custom backward
- BigramHash removal to save artifact space
- 10% magnitude pruning with safety valve for fitting under 16MB
- Bilateral confidence-gated CMS n-gram eval cache