PR #1759

open

Non-record: SP8192 + LoRA on tied embedding (1.07994, 1 seed)

by yijieyuanView on GitHub
val_bpb
1.0799
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.99 MB

Training Techniques

Quantization
GPTQ
bits: 8
scope: embeddings
GPTQ
bits: 8
scope: tied embedding
Architecture
weight tying
Tied token embeddings used in the model.
parameters: null
LeakyReLU
Leaky ReLU activation used in the MLP.
parameters: {"slope":0.5}
depth recurrence
Recurrent reuse of layers to create virtual depth.
parameters: {"layers":3,"activate_at_frac":0.35}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
U-Net skip connections
Skip connections used in the architecture.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R"}
AdamW
weight_decay: 0.095
momentum: 0.9
other_params: {"mlr":0.022}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_size":32000}
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"warmdown":0.72}
Other
other
Rank-1 int8 LoRA residual added to the tied token embedding after GPTQ rounding.
parameters: {"rank":1}
other
Hessian-weighted shrinkage during GPTQ rounding with an extended zero-zone for low-Hessian columns.
parameters: {"thresh":0.55,"h_cutoff":0.5}

Novel Contributions

  • Rank-1 int8 LoRA residual on the tied token embedding
  • Hessian-weighted shrinkage in GPTQ rounding for low-Hessian columns
  • Applied both additions only at the GPTQ quantization stage on the tied embedding
  • Single-seed non-record extension of the bigbag SOTA stack