PR #1759

open

Non-record: SP8192 + LoRA on tied embedding (1.07994, 1 seed)

by yijieyuanView on GitHub

val_bpb

1.0799

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.99 MB

Training Techniques

Quantization

GPTQ

bits: 8

scope: embeddings

GPTQ

bits: 8

scope: tied embedding

Architecture

weight tying

Tied token embeddings used in the model.

parameters: null

LeakyReLU

Leaky ReLU activation used in the MLP.

parameters: {"slope":0.5}

depth recurrence

Recurrent reuse of layers to create virtual depth.

parameters: {"layers":3,"activate_at_frac":0.35}

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

U-Net skip connections

Skip connections used in the architecture.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R"}

AdamW

weight_decay: 0.095

momentum: 0.9

other_params: {"mlr":0.022}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"chunk_size":32000}

Evaluation

sliding window eval

parameters: null

LR Schedule

warmdown

parameters: {"warmdown":0.72}

Other

other

Rank-1 int8 LoRA residual added to the tied token embedding after GPTQ rounding.

parameters: {"rank":1}

other

Hessian-weighted shrinkage during GPTQ rounding with an extended zero-zone for low-Hessian columns.

parameters: {"thresh":0.55,"h_cutoff":0.5}

Novel Contributions

Rank-1 int8 LoRA residual on the tied token embedding
Hessian-weighted shrinkage in GPTQ rounding for low-Hessian columns
Applied both additions only at the GPTQ quantization stage on the tied embedding
Single-seed non-record extension of the bigbag SOTA stack