PR #614

closed

Record: 0.6864 BPB — K-LoRA + Min-NLL + FlashAttention-3

by bigbagView on GitHub

val_bpb

0.6864

Architecture

Transformer

Optimizer

Adam

Artifact Size

15.53 MB

Training Techniques

Architecture

MLP3x

Uses a 3x MLP expansion in the model architecture.

parameters: null

SmearGate

Includes SmearGate as an architectural component.

parameters: null

BigramHash

Adds BigramHash with 2048 buckets/features.

parameters: {"dimensions":2048}

U-Net skip connections

Uses U-Net style skip connections in the model.

parameters: null

GQA

Grouped-query attention with 8 query heads and 4 key/value heads.

parameters: {"query_heads":8,"kv_heads":4}

FlashAttention-3

Uses flash attention for causal attention with a Rotary cache clone fix for CUDA graph compatibility.

parameters: null

K-Projection LoRA

Applies LoRA to K projections in addition to Q/V with a reduced learning-rate multiplier.

parameters: {"k_lr_multiplier":0.3}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"compiled":true,"newton_schulz":true}

Weight Averaging

EMA

parameters: {"decay":0.999,"every_steps":10}

SWA

parameters: null

Quantization

int6

bits: 6

scope: all

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank_qkv":8,"rank_lm_head":16,"learning_rate":0.01,"epochs":6,"batch_docs_per_gpu":64,"temperature":0.98,"deadline_seconds":550,"per_layer_lr_multipliers":{"lm_head":2,"v":1.5,"q":0.5,"k":0.3,"bias":3}}

LR Schedule

cosine decay

parameters: {"per_step":true}

Evaluation

min-NLL epoch selection

parameters: {"select_best_epoch_per_document":true}

Novel Contributions

K-Projection LoRA applied to K projections with a 0.3x learning-rate multiplier
Min-NLL epoch selection across TTT epochs to avoid late-epoch overfitting
FlashAttention-3 causal attention integration
Rotary cache clone fix for CUDA graph compatibility