PR #2019

open

Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed)

by aquariouseworkmanView on GitHub

val_bpb

1.0585

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,985,934 bytes

Training Techniques

Quantization

GPTQ-lite

bits: 8

scope: mixed precision weights

int8

bits: 8

scope: salient groups

Regularization

logit softcap

parameters: {"asymmetric":true}

Test-Time Training

LoRA TTT

parameters: {"rank":56,"mask":"no_qv","local_lr_mult":0.75}

Architecture

Gated Attention

Smear gate and sparse attention gating used in the model pipeline

parameters: {"smear_gate_enabled":true,"sparse_attn_gate_enabled":true,"sparse_attn_gate_scale":0.5}

Gated Attention

QK gain initialization for attention heads

parameters: {"qk_gain_init":5.25}

Evaluation

long context eval

parameters: {"context_length":2560}

Sequence Length

sequence_length

train_length: null

eval_length: 2560

Other

other

Phased TTT with a larger global-TTT prefix during evaluation

parameters: {"phases":3,"prefix_docs":3000}

Optimizer

AdamW

weight_decay: 0.5

momentum: null

other_params: {"beta2":0.99,"grad_clip_norm":0.3,"min_lr":0.1,"matrix_lr":0.026,"warmdown_frac":0.85}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Novel Contributions

AWQ-lite mixed-precision GPTQ with salient-group int8 promotion
Asymmetric Logit Rescale with learnable positive/negative softcap during TTT evaluation
no_qv TTT mask that disables Q/V LoRA while keeping K/MLP/O
Long-context evaluation at sequence length 2560
Phased TTT with a larger global-TTT prefix and reduced LoRA rank