PR #1215

open

12L rANS + LeakyReLU(0.95)² + Soft XSA (1.1601 BPB, non_record_16mb)

by turbo-indubitableView on GitHub

val_bpb

1.1601

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,912,601 bytes

Training Techniques

Compression

rANS

level: null

Quantization

mixed int5/int6

bits: null

scope: all weights

Architecture

LeakyReLU

LeakyReLU squared activation with slope 0.95

parameters: {"slope":0.95,"power":2}

XSA

Soft XSA with learned per-head alpha on all layers using a position-tiled Triton kernel

parameters: {"layers":12,"learned_per_head_alpha":true}

BigramHash

Removed bigram vocabulary / bigram hash component

parameters: {"bigram_vocab_size":0}

Regularization

magnitude pruning

parameters: {"prune_pct":0.1,"scope":"2D tensors > 65536 params"}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"backend_steps":5}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":20}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

CMS n-gram eval cache with bilateral confidence gating and product mixing

parameters: {"orders":"3-7","hashes":4,"counters":"64M"}

Novel Contributions

Per-tensor adaptive rANS replacing zstd-22
LeakyReLU(0.95)² activation sweep winner
Soft XSA with learned per-head alpha on all layers
Position-tiled Triton kernel for XSA with custom backward
BigramHash removal to save artifact space
10% magnitude pruning with safety valve for fitting under 16MB
Bilateral confidence-gated CMS n-gram eval cache