PR #64

open

Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds)

by yesbhautikView on GitHub

val_bpb

1.1250

Architecture

Transformer

Optimizer

Muon

Artifact Size

under 16MB

Training Techniques

Architecture

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

BigramHash

Adds a BigramHash local-context component.

parameters: {"vocab_size":4096,"embedding_dim":128}

SmearGate

Uses per-dimension SmearGate.

parameters: null

XSA

XSA is removed/disabled to save time for more training steps.

parameters: null

Regularization

LN Scale

parameters: {"scale_rule":"1/sqrt(layer+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: mlp, attn, tok_emb

int6

bits: 6

scope: mlp, attn, tok_emb

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"orthoinit":true}

Initialization

OrthoInit

Orthogonal initialization.

Test-Time Training

full TTT

parameters: {"epochs":25,"learning_rate":0.012,"momentum":0.9,"freeze_blocks":0}

Evaluation

sliding window eval

parameters: {"stride":64}

Compression

zstd

level: 22

Other

other

Uses 11 layers, 512 model dimension, 8 heads, 4 KV heads, and 3x MLP expansion.

parameters: {"layers":11,"model_dim":512,"heads":8,"kv_heads":4,"mlp_hidden":1536}

Novel Contributions

GPTQ-lite optimal clip percentile search during int6 quantization
25-epoch aggressive SGD test-time training on already-graded tokens
Partial RoPE with LN Scale and XSA removed to enable more training steps
Per-dimension SmearGate combined with BigramHash local context
Mixed int6 quantization of MLP, attention, and token embeddings with zstd-22 compression
Muon optimizer with OrthoInit and U-Net skip connections