PR #891

closed

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools

by robbiebusinessaccView on GitHub

val_bpb

1.1428

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64}

Quantization

int6

bits: 6

scope: MLP weights

QAT

bits: 6

scope: int6

Architecture

MLP3x

Expand MLP width from 2x to 3x.

parameters: null

SmearGate

Learned gate mixing current and previous token embeddings.

parameters: null

BigramHash

Hashed bigram pair representations.

parameters: null

XSA

XSA applied to the last 4 layers.

parameters: {"layers":4}

LeakyReLU

LeakyReLU squared activation.

parameters: {"negative_slope":0.5}

Partial RoPE

Apply RoPE to a subset of dimensions.

parameters: {"dimensions":"16/64"}

VE128

Value embeddings with dimension 128.

parameters: {"dimensions":128}

Regularization

weight decay

parameters: {"weight_decay":0.04}

LN scale

parameters: {"scale":"1/sqrt(layer)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Initialization

OrthoInit

Orthogonal initialization for better-conditioned matrices.

Compression

lzma

level: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"chunk_size":256,"freeze_blocks":0}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Technique tier list with measured BPB deltas and source PRs
Interaction effects matrix showing sub-additive technique combinations
BPB verification checklist for formula and causal correctness
Collected n-gram legality rulings and organizer guidance in one place
Negative results index linking to prior research PRs
Parameter budget calculator with verified configurations