PR #891

closed

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools

by robbiebusinessaccView on GitHub
val_bpb
1.1428
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
int6
bits: 6
scope: MLP weights
QAT
bits: 6
scope: int6
Architecture
MLP3x
Expand MLP width from 2x to 3x.
parameters: null
SmearGate
Learned gate mixing current and previous token embeddings.
parameters: null
BigramHash
Hashed bigram pair representations.
parameters: null
XSA
XSA applied to the last 4 layers.
parameters: {"layers":4}
LeakyReLU
LeakyReLU squared activation.
parameters: {"negative_slope":0.5}
Partial RoPE
Apply RoPE to a subset of dimensions.
parameters: {"dimensions":"16/64"}
VE128
Value embeddings with dimension 128.
parameters: {"dimensions":128}
Regularization
weight decay
parameters: {"weight_decay":0.04}
LN scale
parameters: {"scale":"1/sqrt(layer)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Initialization
OrthoInit
Orthogonal initialization for better-conditioned matrices.
Compression
lzma
level: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"chunk_size":256,"freeze_blocks":0}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • Technique tier list with measured BPB deltas and source PRs
  • Interaction effects matrix showing sub-additive technique combinations
  • BPB verification checklist for formula and causal correctness
  • Collected n-gram legality rulings and organizer guidance in one place
  • Negative results index linking to prior research PRs
  • Parameter budget calculator with verified configurations