PR #569

open

Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175)

by gowtham0992View on GitHub
val_bpb
1.1175
Architecture
Transformer
Optimizer
Muon (matrix params), AdamW (embeddings and scalars)
Artifact Size
≤15.94 MB

Training Techniques

Quantization
Full GPTQ
bits: 6
scope: all large weights (MLP, attention, bigram, VE projections); int8 for embeddings
QAT-export alignment
bits: null
scope: null
2% magnitude pruning post-quantization
bits: null
scope: int6 weights
Architecture
Value Residual Learning (VRL)
Layer 0's V output added to all subsequent layers via learned sigmoid gates
parameters: {"learned_alphas":10,"sigmoid_init":0}
LeakyReLU(0.5)²
Replaces relu², preserves negative gradient flow, doubles effective MLP capacity
parameters: {"negative_slope":0.5}
XSA-all
Exclusive Self Attention on all 11 layers
parameters: {"layers":11}
SmearGate
Learned interpolation between current and previous token
parameters: null
BigramHash
2048 buckets, dim=128, projected to model_dim=512
parameters: {"buckets":2048,"dim":128,"model_dim":512}
Partial RoPE + NTK-aware scaling
Partial Rotary Positional Embeddings on 16/64 dims with NTK scaling base=10000
parameters: {"partial_dims":[16,64],"ntk_base":10000}
LN Scale
Per-layer learned scale on attention and MLP outputs
parameters: null
Shared Value Embedding
Dim=128, shared between layers 9 and 10 with per-layer learned scales
parameters: {"dim":128,"layers":[9,10]}
Tied embeddings
Weight tying with init std=0.005
parameters: {"init_std":0.005}
Initialization
OrthoInit
Orthogonal initialization for matrix weights, zero-init for output projections
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025,"momentum_warmup":"0.92 to 0.99 over 1500 steps"}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997,"frequency":"every step"}
Tight SWA
parameters: {"frequency":"every 50 steps","condition":"when LR scale < 0.2"}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500,"type":"cosine decay"}
Regularization
weight decay
parameters: {"weight_decay":0.04}
gradient clipping
parameters: {"clip_value":0.3}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • First non-TTT Value Residual Learning (VRL) result on standard architecture
  • Use of LeakyReLU(0.5)² activation replacing relu² to preserve negative gradient flow and double effective MLP capacity
  • Full GPTQ implementation with Hessian-aware int6 quantization and Cholesky inverse error compensation
  • QAT-export alignment with STE clip quantile(0.9995) matching GPTQ export quantizer
  • 2% magnitude pruning post-quantization for improved zstd compressibility
  • Extending Exclusive Self Attention (XSA) to all 11 layers
  • Combination of multiple advanced techniques (EMA, Tight SWA, Late QAT) for improved training stability and quantization
  • Custom raw binary serialization with no torch.save overhead