PR #375

open

Non-record: Negative results & insights from 24hrs on 8xH100

by charmquark1984View on GitHub

val_bpb

1.1257

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.5MB

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Quantization

int6

bits: 6

scope: all

int4

bits: 4

scope: all

mixed int4/int5

bits: null

scope: MLP and attention

QAT

bits: 4

scope: full-run

Compression

zstd

level: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"start_step":null}

Optimizer

Muon

weight_decay: 0.03

momentum: null

other_params: null

Architecture

XSA

Attention/sequence modeling component used in the PR #315 base model.

parameters: null

MLP3x

Three-times wider MLP blocks in the base Transformer.

parameters: {"multiplier":3}

BigramHash

Hashes consecutive token pairs into learned embedding buckets.

parameters: {"buckets":4096}

Test-Time Training

causal TTT

parameters: {"learning_rate":0.0001,"chunk_size":32000}

causal TTT

parameters: {"learning_rate":0.01,"scope":"last 2 blocks MLP only"}

Reptile meta-learning TTT

parameters: {"inner_lr":0.1,"outer_lr":0.01,"inner_steps":3,"budget_fraction":0.2}

Other

other

Multi-token prediction auxiliary heads predicting tokens 2+ steps ahead during training.

parameters: {"num_heads":2,"loss_weight":0.3}

other

Memory tokens: 64 learnable prefix embeddings prepended during training and evaluation.

parameters: {"num_tokens":64}

other

Gradient-guided mixed-bit quantization based on accumulated squared gradients.

parameters: {"top_percent_int7":10,"middle_percent_int6":70,"bottom_percent_int5":20}

other

Cautious weight decay that applies decay only when gradient and weight have the same sign.

parameters: null

other

1M batch size training.

parameters: {"train_batch_tokens":1048576}

other

786K batch size training.

parameters: {"train_batch_tokens":786432}

other

524K batch size training.

parameters: {"train_batch_tokens":524288}

other

cuDNN scaled dot-product attention backend instead of Flash SDP.

parameters: null

other

Canon layers from Allen-Zhu's Physics of Language Models.

parameters: {"K":3}

other

Full-run quantization-aware training with STE fake quantization throughout training.

parameters: null

other

Flash Attention 3 / Hopper-native attention backend.

parameters: null

Regularization

weight decay

parameters: {"value":0.035}

weight decay

parameters: {"value":0.04}

weight decay

parameters: {"value":0.041}

weight decay

parameters: {"value":0.042}

weight decay

parameters: {"value":0.043}

weight decay

parameters: {"value":0.045}

weight decay

parameters: {"value":0.05}

label smoothing

parameters: {"value":0.05}

L1 regularization

parameters: {"lambda":0.0001}

L1 regularization

parameters: {"lambda":0.000001}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Systematic negative-results study of 13 techniques on top of the PR #315 base model
Verified that EMA outperforms SWA by about 0.003 BPB
Showed that weight decay can be used as a precise knob to control compressed artifact size
Demonstrated that 786K batch size outperforms 524K batch size under the 10-minute wallclock constraint
Found that Flash Attention 3 on Hopper yields better wallclock performance than slower attention backends in this setting
Quantified the throughput cost of many techniques, showing that small per-step overheads can dominate final BPB
Documented that INT4 quantization gap is too large to offset its parameter-count advantage in this track