PR #1672

open

Record: GDN-Hybrid + TMA Megakernel + Brotli-11 — val_bpb 1.01195 (3-seed mean)

by andrewbaggio1View on GitHub
val_bpb
1.0119
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15,765,907 bytes

Training Techniques

Architecture
GatedDeltaNet
Recurrent GDN layers used as the main backbone.
parameters: {"layers":5,"dim":512,"heads":8,"head_dim":64}
SWA
Sliding Window Attention with causal masking and weight sharing.
parameters: {"window":512,"qk_gain":5}
ReLU²
Fused MLP uses ReLU squared activation in a Triton megakernel.
parameters: {"block_m":128,"block_n":128,"block_k":64}
weight tying
Tied input/output embeddings.
parameters: null
SmearGate
SmearGate used in the model.
parameters: null
depth recurrence
Some layers are looped multiple times to create virtual depth.
parameters: {"physical_layers":11,"virtual_layers":17}
Partial RoPE
Partial rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
U-Net skip connections
Skip gates / U-Net-style skip connections are used.
parameters: null
BigramHash
Eval-time hash embedding based on previous/current token pairs.
parameters: {"size":16384}
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"variant":"MuonEq-R","adamw_for":"scalars/embeddings"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_checkpoint_range":"17-18"}
LR Schedule
warmdown
parameters: {"total_iterations":2100,"warmdown_iterations":1000,"schedule":"cosine"}
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: embeddings
Compression
brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs_per_chunk":3}

Novel Contributions

  • Fused Triton Hopper TMA persistent MLP megakernel for the relu_sq forward pass
  • Brotli-11 artifact compression with size-descending tensor ordering
  • GPTQ int6 quantization with eigendecomposition-based Hessian repair
  • GDN-Hybrid architecture combining recurrent GatedDeltaNet layers with sliding window attention
  • Score-first TTT and eval-time hash embedding / Tap-In style retrieval improvements