PR #2019

open

Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed)

by aquariouseworkmanView on GitHub
val_bpb
1.0585
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,985,934 bytes

Training Techniques

Quantization
GPTQ-lite
bits: 8
scope: mixed precision weights
int8
bits: 8
scope: salient groups
Regularization
logit softcap
parameters: {"asymmetric":true}
Test-Time Training
LoRA TTT
parameters: {"rank":56,"mask":"no_qv","local_lr_mult":0.75}
Architecture
Gated Attention
Smear gate and sparse attention gating used in the model pipeline
parameters: {"smear_gate_enabled":true,"sparse_attn_gate_enabled":true,"sparse_attn_gate_scale":0.5}
Gated Attention
QK gain initialization for attention heads
parameters: {"qk_gain_init":5.25}
Evaluation
long context eval
parameters: {"context_length":2560}
Sequence Length
sequence_length
train_length: null
eval_length: 2560
Other
other
Phased TTT with a larger global-TTT prefix during evaluation
parameters: {"phases":3,"prefix_docs":3000}
Optimizer
AdamW
weight_decay: 0.5
momentum: null
other_params: {"beta2":0.99,"grad_clip_norm":0.3,"min_lr":0.1,"matrix_lr":0.026,"warmdown_frac":0.85}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}

Novel Contributions

  • AWQ-lite mixed-precision GPTQ with salient-group int8 promotion
  • Asymmetric Logit Rescale with learnable positive/negative softcap during TTT evaluation
  • no_qv TTT mask that disables Q/V LoRA while keeping K/MLP/O
  • Long-context evaluation at sequence length 2560
  • Phased TTT with a larger global-TTT prefix and reduced LoRA rank