PR #605

closed

Record: 0.7227 BPB — 10L LoRA TTT 6ep + FlashAttention-3

by bigbagView on GitHub

val_bpb

0.7227

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.45 MB

Training Techniques

Architecture

MLP3x

Uses a 3x MLP with ReLU-squared activation.

parameters: null

SmearGate

Adds SmearGate to the model architecture.

parameters: null

BigramHash

Uses BigramHash features for token interactions.

parameters: {"size":2048}

U-Net skip connections

Introduces encoder/decoder-style skip connections.

parameters: null

tied embeddings

Shares input embedding and output projection weights.

parameters: null

Quantization

int6

bits: 6

scope: all weights with FP16 passthrough for embeddings and control tensors

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz":true,"compiled":true}

AdamW

weight_decay: null

momentum: null

other_params: {"fused":true}

Weight Averaging

EMA

parameters: {"decay":0.999,"every_steps":10}

SWA

parameters: {"checkpoints":11}

Test-Time Training

LoRA TTT

parameters: {"rank_qv":8,"rank_lm_head":16,"learning_rate":0.01,"epochs":6,"batch_docs_per_gpu":64}

LR Schedule

warmdown + cosine decay

parameters: {"warmdown_steps":6000,"per_step_cosine_decay":true}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Regularization

gradient clipping

parameters: {"max_norm":1}

Other

other

Late QAT during warmdown.

parameters: null

other

FlashAttention-3 integration for faster causal attention on H100.

parameters: null

other

Rotary cache .clone() fix to resolve CUDA graph conflict with FlashAttention-3.

parameters: null

Novel Contributions

FlashAttention-3 integration for faster attention on H100
Rotary cache .clone() fix for CUDA graph compatibility with FlashAttention-3
LoRA-based test-time training with per-document adaptation
Per-layer learning rates for LoRA and bias parameters during TTT
Score-every-epoch backward-looking evaluation compliant with Issue #402
Late QAT combined with int6 quantization and zstd compression