PR #548

closed

Record: Loqui Auris — 10L + LoRA TTT (mean val_bpb=1.0865, 2 seeds)

by LoquiAurisView on GitHub

val_bpb

1.0865

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.81 MB

Training Techniques

Architecture

SmearGate

Learned blend with previous token representation.

parameters: null

BigramHash

Bigram hashing feature with 4096 buckets projected into model dimension.

parameters: {"buckets":4096,"dim":128}

MLP3x

3x feed-forward expansion in the MLP.

parameters: {"layers":10,"d_model":512,"heads":8,"kv_heads":4}

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

Tied embeddings / LM head via linear projection using token embedding weights.

parameters: null

RoPE

Rotary positional encoding.

parameters: {"persistent":false}

U-Net skips

Skip connections between symmetric layer pairs.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02,"warmup_momentum_start":0.92,"warmup_steps":1500,"adamw_weight_decay":0.01}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

int6

bits: 6

scope: MLP and attention weights

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"targets":["Q","V","LM head"],"epochs":2}

Sequence Length

sequence_length

train_length: 2048

eval_length: 1024

LR Schedule

warmup + warmdown cosine schedule

parameters: {"warmup_steps":20,"warmdown_iterations":3000}

Initialization

OrthoInit

Orthogonal initialization.

Novel Contributions

10-layer Transformer with SmearGate, BigramHash, and U-Net skip connections
EMA weight averaging with decay 0.997
Per-document LoRA test-time training on Q, V, and LM head
Batched TTT across 64 documents per GPU on 8 GPUs
Fix for torch.compile graph caching by resetting Dynamo and using a fresh uncompiled model for TTT
Int6 quantization of MLP and attention weights with zstd compression