PR #569

open

Record: 11L VRL + LeakyReLU² + Full GPTQ (3-seed mean val_bpb=1.1175)

by gowtham0992View on GitHub

val_bpb

1.1175

Architecture

Transformer

Optimizer

Muon (matrix params), AdamW (embeddings and scalars)

Artifact Size

≤15.94 MB

Training Techniques

Quantization

Full GPTQ

bits: 6

scope: all large weights (MLP, attention, bigram, VE projections); int8 for embeddings

QAT-export alignment

bits: null

scope: null

2% magnitude pruning post-quantization

bits: null

scope: int6 weights

Architecture

Value Residual Learning (VRL)

Layer 0's V output added to all subsequent layers via learned sigmoid gates

parameters: {"learned_alphas":10,"sigmoid_init":0}

LeakyReLU(0.5)²

Replaces relu², preserves negative gradient flow, doubles effective MLP capacity

parameters: {"negative_slope":0.5}

XSA-all

Exclusive Self Attention on all 11 layers

parameters: {"layers":11}

SmearGate

Learned interpolation between current and previous token

parameters: null

BigramHash

2048 buckets, dim=128, projected to model_dim=512

parameters: {"buckets":2048,"dim":128,"model_dim":512}

Partial RoPE + NTK-aware scaling

Partial Rotary Positional Embeddings on 16/64 dims with NTK scaling base=10000

parameters: {"partial_dims":[16,64],"ntk_base":10000}

LN Scale

Per-layer learned scale on attention and MLP outputs

parameters: null

Shared Value Embedding

Dim=128, shared between layers 9 and 10 with per-layer learned scales

parameters: {"dim":128,"layers":[9,10]}

Tied embeddings

Weight tying with init std=0.005

parameters: {"init_std":0.005}

Initialization

OrthoInit

Orthogonal initialization for matrix weights, zero-init for output projections

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025,"momentum_warmup":"0.92 to 0.99 over 1500 steps"}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997,"frequency":"every step"}

Tight SWA

parameters: {"frequency":"every 50 steps","condition":"when LR scale < 0.2"}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500,"type":"cosine decay"}

Regularization

weight decay

parameters: {"weight_decay":0.04}

gradient clipping

parameters: {"clip_value":0.3}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

First non-TTT Value Residual Learning (VRL) result on standard architecture
Use of LeakyReLU(0.5)² activation replacing relu² to preserve negative gradient flow and double effective MLP capacity
Full GPTQ implementation with Hessian-aware int6 quantization and Cholesky inverse error compensation
QAT-export alignment with STE clip quantile(0.9995) matching GPTQ export quantizer
2% magnitude pruning post-quantization for improved zstd compressibility
Extending Exclusive Self Attention (XSA) to all 11 layers
Combination of multiple advanced techniques (EMA, Tight SWA, Late QAT) for improved training stability and quantization
Custom raw binary serialization with no torch.save overhead