PR #1227

open

Non-record: 28 Experiments in 5 Days — What Works, What Fails, and Why Small-Scale Tests Lie

by himanshudongreView on GitHub
val_bpb
1.4841
Architecture
Transformer
Optimizer
AdamW
Artifact Size
16MB

Training Techniques

Architecture
RoPE
Partial RoPE embeddings using only 16 dimensions.
parameters: {"dimensions":16}
weight tying
Tied input and output embeddings.
parameters: null
GQA
Compared full multi-head attention against grouped query attention.
parameters: {"heads":6}
MLP3x
Compared 2.0x MLP width against 3x MLP width.
parameters: {"multiplier":2}
SSM
S4D-Lin hybrid replacing attention in bottom layers.
parameters: {"layers":2}
DenseFormer
Dynamic weighted averaging across layers.
parameters: null
Monarch Matrices
Structured low-rank/compressed matrix parameterization.
parameters: null
Quantization
QAT
bits: 5
scope: all
STE QAT
bits: 5
scope: all
GPTQ
bits: 5
scope: all
int6
bits: 6
scope: all
Optimizer
AdamW
weight_decay: 0.1
momentum: null
other_params: {"lr":0.0003}
LR Schedule
cosine decay
parameters: null
Regularization
weight decay
parameters: {"value":0.1}
logit softcap
parameters: null
LN scale
parameters: null
Test-Time Training
score-first TTT
parameters: {"epochs":10,"optimizer":"SGD","schedule":"cosine decay","rank":8}
Evaluation
entropy-adaptive mixing
parameters: null
sliding window eval
parameters: null
Other
other
Dirichlet CTW-6 Bayesian n-gram mixing for eval-time prediction.
parameters: {"order":6}
other
Knowledge distillation from a larger teacher to a smaller student.
parameters: {"temperature":2,"alpha":0.5}

Novel Contributions

  • Small-scale local experiments can be systematically misleading and even point in the opposite direction from GPU-scale results.
  • Quantization-aware training acts as a regularizer and can outperform float32 training.
  • PAQ-style logistic mixing is fundamentally broken for multi-class language modeling.
  • Dirichlet CTW Bayesian n-gram mixing is properly normalized and effective at eval time.
  • Knowledge distillation within a single competition run can improve the smaller student model.
  • Score-first test-time training with a cosine schedule provides a small but real gain.