PR #1586
openRecord: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0749
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.93 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: MLP and attention weights
GPTQ
bits: 7
scope: embeddings
GPTQ
bits: 6
scope: all matrices
Architecture
weight tying
Tied token embeddings / tied embeddings
parameters: null
LeakyReLU
Uses LeakyReLU activation in the MLP
parameters: {"slope":0.5}
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":"16/64"}
depth recurrence
Triple recurrence with selected layers looped multiple times
parameters: {"layers":"3-5","loops":2}
U-Net skip connections
Sigmoid-gated U-Net style skip connections
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}
AdamW
weight_decay: 0.5
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: {"decay":0.9965}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.75}
Test-Time Training
LoRA TTT
parameters: {"rank":96,"learning_rate":0.0001,"chunk_size":48,"weight_decay":0.5,"score_first":true,"doc_independent":true}
Compression
Brotli
level: 11
Novel Contributions
- Per-layer adaptive GPTQ clipping with different clip_sigmas for MLP and attention layers
- int7 token embeddings to reduce artifact size while preserving quality
- Systematic tuning of MATRIX_LR to 0.026
- Combining GPTQ quantization with doc-independent LoRA test-time training under the size budget