PR #1130
openRecord: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean)
by GusanidasView on GitHub
val_bpb
1.1140
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.88 MB
Training Techniques
Architecture
Residual lambdas
Learnable per-sublayer residual scaling with exponential recency bias across layers.
parameters: {"init":"sqrt(1.1)"}
BigramHash
Expanded hash-based embedding table to reduce collision ratio.
parameters: {"buckets":6144}
VE196
Increased value embedding dimension on selected layers.
parameters: {"dimensions":196,"layers":[5,9,10]}
XSA
Exclusive self-attention applied to the last 7 layers.
parameters: {"layers":7}
LeakyReLU
LeakyReLU squared activation in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
Cache + backout
Cached layer 7 hidden state is subtracted back via a learnable gate before the LM head.
parameters: {"layer":7}
Partial RoPE
Rotary positional embeddings applied partially.
parameters: {"dimensions":"16/64"}
MLP3x
Three-times MLP expansion.
parameters: null
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"split_early_late_lr":true,"matrix_lr_early":0.036,"matrix_lr_late":0.044,"scalar_lr_early":0.028,"scalar_lr_late":0.018}
Quantization
GPTQ
bits: 6
scope: full model
late QAT
bits: null
scope: full model
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Coprime-stride multi-shard data loader for batch diversity.
parameters: null
other
Train-data GPTQ calibration performed within the training budget.
parameters: {"calibration_time_seconds":14}
other
MiLe margin loss with entropy-weighted cross-entropy and gamma=0.75.
parameters: {"gamma":0.75}
Novel Contributions
- Residual lambdas for per-sublayer residual scaling
- Split early/late learning-rate banks for Muon and Adam parameters
- Train-budget GPTQ calibration using training data
- Coprime-stride multi-shard data loader
- Expanded BigramHash table
- Larger value embeddings on selected layers
- XSA extended to the last 7 layers
- MiLe margin loss
- Cache-and-backout residual path
- Flash Attention 3 integration
- Tuned batch size for the training budget