PR #595

closed

Record: Loqui Auris — 10L + SWA + Standard TTT (val_bpb=1.1100)

by LoquiAurisView on GitHub

val_bpb

1.1100

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.69 MB

Training Techniques

Architecture

SmearGate

Learned blend with previous token representation.

parameters: null

BigramHash

Bigram hashing feature with 4096 buckets projected to model dimension.

parameters: {"buckets":4096,"projection_dim":512}

MLP3x

Feed-forward network expanded to 3x hidden size.

parameters: {"layers":10,"d_model":512,"heads":8,"kv_heads":4,"mlp_multiplier":3}

tied embeddings

Input embeddings are tied to output logits via linear projection with shared weights.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":29,"checkpoint_interval_steps":50,"start_frac":0.5}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02,"warmup_momentum_start":0.92,"warmup_steps":1500}

AdamW

weight_decay: 0.01

momentum: null

other_params: {"used_for":"embeddings and scalars"}

Quantization

mixed int5/int6

bits: null

scope: MLP int5, attention int6, embeddings/norms/gates FP16/FP32 passthrough

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Test-Time Training

full TTT

parameters: {"optimizer":"AdamW","learning_rate":0.0005,"epochs":10,"weight_decay":0,"gradient_clipping":1}

Initialization

OrthoInit

Orthogonal initialization.

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_iterations":3000}

Regularization

weight decay

parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}

Novel Contributions

Standard AdamW test-time training applied to the quantized-then-dequantized model weights
10-layer Transformer with SmearGate, BigramHash, and U-Net skip connections
SWA over 29 checkpoints before quantization
Mixed int5/int6 quantization with FP16/FP32 passthrough for selected tensors