PR #1130

open

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)

by GusanidasView on GitHub

val_bpb

1.1140

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.88 MB

Training Techniques

Architecture

Residual lambdas

Learnable per-sublayer residual scaling with exponential recency bias across layers.

parameters: {"init":"sqrt(1.1)"}

BigramHash

Expanded hash-based embedding table to reduce collision ratio.

parameters: {"buckets":6144}

VE196

Increased value embedding dimension on selected layers.

parameters: {"dimensions":196,"layers":[5,9,10]}

XSA

Exclusive self-attention applied to the last 7 layers.

parameters: {"layers":7}

LeakyReLU

LeakyReLU squared activation in the MLP.

parameters: {"squared":true,"negative_slope":0.5}

Cache + backout

Cached layer 7 hidden state is subtracted back via a learnable gate before the LM head.

parameters: {"layer":7}

Partial RoPE

Rotary positional embeddings applied partially.

parameters: {"dimensions":"16/64"}

MLP3x

Three-times MLP expansion.

parameters: null

RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":"16/64"}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"split_early_late_lr":true,"matrix_lr_early":0.036,"matrix_lr_late":0.044,"scalar_lr_early":0.028,"scalar_lr_late":0.018}

Quantization

GPTQ

bits: 6

scope: full model

late QAT

bits: null

scope: full model

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Coprime-stride multi-shard data loader for batch diversity.

parameters: null

other

Train-data GPTQ calibration performed within the training budget.

parameters: {"calibration_time_seconds":14}

other

MiLe margin loss with entropy-weighted cross-entropy and gamma=0.75.

parameters: {"gamma":0.75}

Novel Contributions

Residual lambdas for per-sublayer residual scaling
Split early/late learning-rate banks for Muon and Adam parameters
Train-budget GPTQ calibration using training data
Coprime-stride multi-shard data loader
Expanded BigramHash table
Larger value embeddings on selected layers
XSA extended to the last 7 layers
MiLe margin loss
Cache-and-backout residual path
Flash Attention 3 integration
Tuned batch size for the training budget