PR #614

closed

Record: 0.6864 BPB — K-LoRA + Min-NLL + FlashAttention-3

val_bpb
0.6864
Architecture
Transformer
Optimizer
Adam
Artifact Size
15.53 MB

Training Techniques

Architecture
MLP3x
Uses a 3x MLP expansion in the model architecture.
parameters: null
SmearGate
Includes SmearGate as an architectural component.
parameters: null
BigramHash
Adds BigramHash with 2048 buckets/features.
parameters: {"dimensions":2048}
U-Net skip connections
Uses U-Net style skip connections in the model.
parameters: null
GQA
Grouped-query attention with 8 query heads and 4 key/value heads.
parameters: {"query_heads":8,"kv_heads":4}
FlashAttention-3
Uses flash attention for causal attention with a Rotary cache clone fix for CUDA graph compatibility.
parameters: null
K-Projection LoRA
Applies LoRA to K projections in addition to Q/V with a reduced learning-rate multiplier.
parameters: {"k_lr_multiplier":0.3}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"compiled":true,"newton_schulz":true}
Weight Averaging
EMA
parameters: {"decay":0.999,"every_steps":10}
SWA
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank_qkv":8,"rank_lm_head":16,"learning_rate":0.01,"epochs":6,"batch_docs_per_gpu":64,"temperature":0.98,"deadline_seconds":550,"per_layer_lr_multipliers":{"lm_head":2,"v":1.5,"q":0.5,"k":0.3,"bias":3}}
LR Schedule
cosine decay
parameters: {"per_step":true}
Evaluation
min-NLL epoch selection
parameters: {"select_best_epoch_per_document":true}

Novel Contributions

  • K-Projection LoRA applied to K projections with a 0.3x learning-rate multiplier
  • Min-NLL epoch selection across TTT epochs to avoid late-epoch overfitting
  • FlashAttention-3 causal attention integration
  • Rotary cache clone fix for CUDA graph compatibility