PR #1969

open

SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean)

by bsisduckView on GitHub

val_bpb

1.0804

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Architecture

LeakyReLU

MLP uses LeakyReLU squared activation in 4x2048 blocks

parameters: {"mlp_multiplier":4,"hidden_size":2048}

GQA

Grouped query attention with 2:1 grouping

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Partial rotary position embeddings applied to a subset of dimensions

parameters: {"dimensions":16,"base":10000,"total_dimensions":64}

U-Net skip connections

Encoder-decoder skip connections with skip gates

parameters: null

depth recurrence

Looped layers 3-5 for virtual depth expansion

parameters: {"loop_layers":[3,5],"num_loops":2,"virtual_layers":17}

SmearGate

Position-mixing gate widened to 32 dimensions

parameters: {"width":32}

Gated Attention

Per-head attention output gating widened from 12 to 32 input dimensions

parameters: {"gate_width":32}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"variant":"Polar-Express","ns_steps":5,"minimax_tuples":true}

Regularization

logit softcap

parameters: {"value":30}

Quantization

GPTQ

bits: 6

scope: all weights

Compression

brotli

level: 11

Test-Time Training

LoRA TTT

parameters: {"rank":96,"phases":1,"prefix_docs":2000}

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Novel Contributions

Wider attention output gates with GATE_WIDTH=32
Widened SmearGate to width 32
SP8192 CaseOps tokenizer with bijective case markers
GPTQ int6 quantization of all weights with brotli compression
Polar-Express Muon optimization setup
TTT with LoRA rank-96 on 2000 prefix docs