PR #1027

open

Non-record: LeakyReLU² + BigramHash + Int5/Int6 + SlidingWindow — val_bpb 1.3036 (1×H100)

by Syed-M-ZeeshanView on GitHub
val_bpb
1.3036
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,893,048 bytes

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU(0.5)^2 activation in the MLP instead of standard ReLU^2.
parameters: {"negative_slope":0.5}
BigramHash
Adds hashed bigram embeddings as cheap n-gram features.
parameters: {"buckets":1536,"dim":128}
U-Net skip connections
Uses U-Net style skip connections in the model.
parameters: null
GQA
Uses grouped query attention.
parameters: {"kv_heads":4,"heads":8}
Quantization
mixed int5/int6
bits: null
scope: MLP weights and attention weights
fp16
bits: 16
scope: tied embeddings
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start":"last 40% of warmdown","frequency":"every 10 steps"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.02}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"scalar/embed params and pre-quantization TTT"}
Compression
lzma
level: extreme
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"optimizer":"AdamW","chunk_size":32768,"pre_quantization":true}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"iterations":1050,"warmdown":150}
Regularization
logit softcap
parameters: {"softcap":30}

Novel Contributions

  • LeakyReLU(0.5)^2 activation
  • BigramHash(1536, dim=128) features
  • Mixed Int5/Int6 quantization with FP16 tied embeddings
  • EMA plus SWA weight averaging
  • AdamW pre-quantization test-time training
  • LZMA artifact compression
  • Sliding window evaluation with stride 64