PR #1555

open

Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean)

by andrewbaggio1View on GitHub

val_bpb

1.0764

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.97 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU(0.5) squared in the MLP.

parameters: {"negative_slope":0.5,"power":2}

ReLU²

MLP nonlinearity includes square activation after LeakyReLU.

parameters: null

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

weight tying

Tied input and output embeddings.

parameters: null

depth recurrence

Loops selected layers multiple times to create virtual depth.

parameters: {"layers":"3-5","repeats":2,"virtual_layers":17,"physical_layers":11}

U-Net skip connections

Uses skip connections with gated U-Net-style routing.

parameters: null

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP4x

Uses a 4x MLP expansion.

parameters: {"multiplier":4}

Regularization

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}

AdamW

weight_decay: 0.095

momentum: null

other_params: {"scope":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: token embeddings

Compression

brotli

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs_per_chunk":3}

Other

other

Tap-In unigram matching with min_match=1 and cross-window matching.

parameters: {"min_match":1,"top_k":1000,"cross_window_weight":0.06}

other

Eval-time hash embedding trained through the TTT loop.

parameters: {"size":"16384x512"}

Novel Contributions

TMA Megakernel fused MLP forward kernel with async transfers, improving throughput and enabling more training steps within the time budget.
Tap-In min_match=1 unigram matching, lowering the retrieval threshold from 3 tokens to 1 token.
Improved parallel residuals ported from PR #1529.