PR #1430

open

Record: Per-Sample SLOT + N-gram Order-22 + TTT + LR=0.432 — val_bpb 0.39642 (3-seed mean)

val_bpb

0.3964

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.90MB

Training Techniques

Architecture

BigramHash

Causal backoff n-gram mixer using hashed n-gram probabilities up to order 22.

parameters: {"order":22,"buckets":4000000}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"steps":24,"lr_start":0.432,"lr_end":0.001,"beta1":0.6,"beta2":0.5,"batch_size":128}

LR Schedule

cosine decay

parameters: {"lr_start":0.432,"lr_end":0.001}

Test-Time Training

full TTT

parameters: {"optimizer":"AdamW","epochs":1,"learning_rate":0.001,"freeze_blocks":"0-9","second_pass_fraction":0.1,"floor_lr":0.0001}

Quantization

GPTQ

bits: 6

scope: model weights

Other

other

Per-sample SLOT optimization with sequence-specific hidden delta and logit bias.

parameters: {"hidden_delta_shape":"[bsz,1,512]","logit_bias_shape":"[bsz,1,1024]","params":1536}

other

Multi-token prediction with 2 heads and auxiliary loss.

parameters: {"heads":2,"loss_weight":0.1}