PR #81

open

Record: SwiGLU + MLP 3x + Int6 + LoRA TTT, val_bpb=1.1670 (8xH100)

by polarizedfortnite-cpuView on GitHub

val_bpb

1.1670

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.83MB

Training Techniques

Architecture

MLP3x

Increased MLP expansion from 2x to 3x to add nonlinear capacity.

parameters: {"mlp_mult":3}

SwiGLU

Replaced relu^2 with SwiGLU activation.

parameters: {"mlp_hidden_dim":1024}

KV head count

Used grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

depth

Added one extra transformer layer over the baseline.

parameters: {"layers":10}

Quantization

STE QAT int6

bits: 6

scope: all weights except tied embeddings

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":8}

LR Schedule

warmdown

parameters: {"warmdown_iters":1200}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"matrices":"Muon","embeddings_scalars":"Adam","matrix_lr":0.04,"embed_lr":0.05}

Novel Contributions

Combined MLP 3x expansion with SwiGLU activation in a compact Transformer.
Applied int6 quantization with zstd compression to fit a larger model under the artifact cap.
Used quantization-aware training with STE during the final quarter of training.
Introduced LoRA-based test-time training during evaluation to improve validation bpb.
Added an extra transformer layer and used grouped-query attention with 4 KV heads.