PR #1555
openRecord: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean)
by andrewbaggio1View on GitHub
val_bpb
1.0764
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.97 MB
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU(0.5) squared in the MLP.
parameters: {"negative_slope":0.5,"power":2}
ReLU²
MLP nonlinearity includes square activation after LeakyReLU.
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
weight tying
Tied input and output embeddings.
parameters: null
depth recurrence
Loops selected layers multiple times to create virtual depth.
parameters: {"layers":"3-5","repeats":2,"virtual_layers":17,"physical_layers":11}
U-Net skip connections
Uses skip connections with gated U-Net-style routing.
parameters: null
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP4x
Uses a 4x MLP expansion.
parameters: {"multiplier":4}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}
AdamW
weight_decay: 0.095
momentum: null
other_params: {"scope":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: token embeddings
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs_per_chunk":3}
Other
other
Tap-In unigram matching with min_match=1 and cross-window matching.
parameters: {"min_match":1,"top_k":1000,"cross_window_weight":0.06}
other
Eval-time hash embedding trained through the TTT loop.
parameters: {"size":"16384x512"}
Novel Contributions
- TMA Megakernel fused MLP forward kernel with async transfers, improving throughput and enabling more training steps within the time budget.
- Tap-In min_match=1 unigram matching, lowering the retrieval threshold from 3 tokens to 1 token.
- Improved parallel residuals ported from PR #1529.