PR #1227

open

Non-record: 28 Experiments in 5 Days — What Works, What Fails, and Why Small-Scale Tests Lie

by himanshudongreView on GitHub

val_bpb

1.4841

Architecture

Transformer

Optimizer

AdamW

Artifact Size

16MB

Training Techniques

Architecture

RoPE

Partial RoPE embeddings using only 16 dimensions.

parameters: {"dimensions":16}

weight tying

Tied input and output embeddings.

parameters: null

GQA

Compared full multi-head attention against grouped query attention.

parameters: {"heads":6}

MLP3x

Compared 2.0x MLP width against 3x MLP width.

parameters: {"multiplier":2}

SSM

S4D-Lin hybrid replacing attention in bottom layers.

parameters: {"layers":2}

DenseFormer

Dynamic weighted averaging across layers.

parameters: null

Monarch Matrices

Structured low-rank/compressed matrix parameterization.

parameters: null

Quantization

QAT

bits: 5

scope: all

STE QAT

bits: 5

scope: all

GPTQ

bits: 5

scope: all

int6

bits: 6

scope: all

Optimizer

AdamW

weight_decay: 0.1

momentum: null

other_params: {"lr":0.0003}

LR Schedule

cosine decay

parameters: null

Regularization

weight decay

parameters: {"value":0.1}

logit softcap

parameters: null

LN scale

parameters: null

Test-Time Training

score-first TTT

parameters: {"epochs":10,"optimizer":"SGD","schedule":"cosine decay","rank":8}

Evaluation

entropy-adaptive mixing

parameters: null

sliding window eval

parameters: null

Other

other

Dirichlet CTW-6 Bayesian n-gram mixing for eval-time prediction.

parameters: {"order":6}

other

Knowledge distillation from a larger teacher to a smaller student.

parameters: {"temperature":2,"alpha":0.5}

Novel Contributions

Small-scale local experiments can be systematically misleading and even point in the opposite direction from GPU-scale results.
Quantization-aware training acts as a regularizer and can outperform float32 training.
PAQ-style logistic mixing is fundamentally broken for multi-class language modeling.
Dirichlet CTW Bayesian n-gram mixing is properly normalized and effective at eval time.
Knowledge distillation within a single competition run can improve the smaller student model.
Score-first test-time training with a cosine schedule provides a small but real gain.