PR #218

closed

qat + ttt + value embeddings

by bopmiteView on GitHub

val_bpb

1.1248

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Quantization

int6

bits: 6

scope: all

QAT

bits: null

scope: all

Architecture

SmearGate

Custom gating mechanism used in the model.

parameters: null

BigramHash

Bigram hashing embedding/vocabulary mechanism.

parameters: {"vocab_size":2048}

MLP3x

Three-layer MLP blocks.

parameters: {"layers":3}

XSA

XSA applied to the last layers of the model.

parameters: {"last_n_layers":4}

RoPE

Rotary positional embeddings with NTK scaling and partial application.

parameters: {"sequence_length":2048}

Partial RoPE

Applies RoPE to only a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

tied embeddings

Uses tied embeddings / value embeddings.

parameters: null

Initialization

OrthoInit

Orthogonal initialization with muP.

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

online logit bias

parameters: {"learning_rate":0.1,"momentum":0.9}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Other

other

Online learned logit bias vector updated during validation to correct logits with exact cross-entropy gradient.

parameters: {"olb_lr":0.1,"olb_momentum":0.9}

Novel Contributions

Online logit bias (OLB) learned during sliding window evaluation
Exact cross-entropy gradient update for the bias vector
Zero-parameter, near-zero-compute test-time correction
Combination of QAT, TTT-style evaluation adaptation, and value/tied embeddings