PR #595

closed

Record: Loqui Auris — 10L + SWA + Standard TTT (val_bpb=1.1100)

by LoquiAurisView on GitHub
val_bpb
1.1100
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.69 MB

Training Techniques

Architecture
SmearGate
Learned blend with previous token representation.
parameters: null
BigramHash
Bigram hashing feature with 4096 buckets projected to model dimension.
parameters: {"buckets":4096,"projection_dim":512}
MLP3x
Feed-forward network expanded to 3x hidden size.
parameters: {"layers":10,"d_model":512,"heads":8,"kv_heads":4,"mlp_multiplier":3}
tied embeddings
Input embeddings are tied to output logits via linear projection with shared weights.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":29,"checkpoint_interval_steps":50,"start_frac":0.5}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02,"warmup_momentum_start":0.92,"warmup_steps":1500}
AdamW
weight_decay: 0.01
momentum: null
other_params: {"used_for":"embeddings and scalars"}
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6, embeddings/norms/gates FP16/FP32 passthrough
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Test-Time Training
full TTT
parameters: {"optimizer":"AdamW","learning_rate":0.0005,"epochs":10,"weight_decay":0,"gradient_clipping":1}
Initialization
OrthoInit
Orthogonal initialization.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iterations":3000}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}

Novel Contributions

  • Standard AdamW test-time training applied to the quantized-then-dequantized model weights
  • 10-layer Transformer with SmearGate, BigramHash, and U-Net skip connections
  • SWA over 29 checkpoints before quantization
  • Mixed int5/int6 quantization with FP16/FP32 passthrough for selected tensors