PR #1672

open

Record: GDN-Hybrid + TMA Megakernel + Brotli-11 — val_bpb 1.01195 (3-seed mean)

by andrewbaggio1View on GitHub

val_bpb

1.0119

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15,765,907 bytes

Training Techniques

Architecture

GatedDeltaNet

Recurrent GDN layers used as the main backbone.

parameters: {"layers":5,"dim":512,"heads":8,"head_dim":64}

SWA

Sliding Window Attention with causal masking and weight sharing.

parameters: {"window":512,"qk_gain":5}

ReLU²

Fused MLP uses ReLU squared activation in a Triton megakernel.

parameters: {"block_m":128,"block_n":128,"block_k":64}

weight tying

Tied input/output embeddings.

parameters: null

SmearGate

SmearGate used in the model.

parameters: null

depth recurrence

Some layers are looped multiple times to create virtual depth.

parameters: {"physical_layers":11,"virtual_layers":17}

Partial RoPE

Partial rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

U-Net skip connections

Skip gates / U-Net-style skip connections are used.

parameters: null

BigramHash

Eval-time hash embedding based on previous/current token pairs.

parameters: {"size":16384}

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"variant":"MuonEq-R","adamw_for":"scalars/embeddings"}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_checkpoint_range":"17-18"}

LR Schedule

warmdown

parameters: {"total_iterations":2100,"warmdown_iterations":1000,"schedule":"cosine"}

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: embeddings

Compression

brotli

level: 11

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs_per_chunk":3}

Novel Contributions

Fused Triton Hopper TMA persistent MLP megakernel for the relu_sq forward pass
Brotli-11 artifact compression with size-descending tensor ordering
GPTQ int6 quantization with eigendecomposition-based Hessian repair
GDN-Hybrid architecture combining recurrent GatedDeltaNet layers with sliding window attention
Score-first TTT and eval-time hash embedding / Tap-In style retrieval improvements