PR #548

closed

Record: Loqui Auris — 10L + LoRA TTT (mean val_bpb=1.0865, 2 seeds)

by LoquiAurisView on GitHub
val_bpb
1.0865
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.81 MB

Training Techniques

Architecture
SmearGate
Learned blend with previous token representation.
parameters: null
BigramHash
Bigram hashing feature with 4096 buckets projected into model dimension.
parameters: {"buckets":4096,"dim":128}
MLP3x
3x feed-forward expansion in the MLP.
parameters: {"layers":10,"d_model":512,"heads":8,"kv_heads":4}
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied embeddings / LM head via linear projection using token embedding weights.
parameters: null
RoPE
Rotary positional encoding.
parameters: {"persistent":false}
U-Net skips
Skip connections between symmetric layer pairs.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02,"warmup_momentum_start":0.92,"warmup_steps":1500,"adamw_weight_decay":0.01}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
int6
bits: 6
scope: MLP and attention weights
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"targets":["Q","V","LM head"],"epochs":2}
Sequence Length
sequence_length
train_length: 2048
eval_length: 1024
LR Schedule
warmup + warmdown cosine schedule
parameters: {"warmup_steps":20,"warmdown_iterations":3000}
Initialization
OrthoInit
Orthogonal initialization.

Novel Contributions

  • 10-layer Transformer with SmearGate, BigramHash, and U-Net skip connections
  • EMA weight averaging with decay 0.997
  • Per-document LoRA test-time training on Q, V, and LM head
  • Batched TTT across 64 documents per GPU on 8 GPUs
  • Fix for torch.compile graph caching by resetting Dynamo and using a fresh uncompiled model for TTT
  • Int6 quantization of MLP and attention weights with zstd compression