PR #375
openNon-record: Negative results & insights from 24hrs on 8xH100
by charmquark1984View on GitHub
val_bpb
1.1257
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.5MB
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Quantization
int6
bits: 6
scope: all
int4
bits: 4
scope: all
mixed int4/int5
bits: null
scope: MLP and attention
QAT
bits: 4
scope: full-run
Compression
zstd
level: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start_step":null}
Optimizer
Muon
weight_decay: 0.03
momentum: null
other_params: null
Architecture
XSA
Attention/sequence modeling component used in the PR #315 base model.
parameters: null
MLP3x
Three-times wider MLP blocks in the base Transformer.
parameters: {"multiplier":3}
BigramHash
Hashes consecutive token pairs into learned embedding buckets.
parameters: {"buckets":4096}
Test-Time Training
causal TTT
parameters: {"learning_rate":0.0001,"chunk_size":32000}
causal TTT
parameters: {"learning_rate":0.01,"scope":"last 2 blocks MLP only"}
Reptile meta-learning TTT
parameters: {"inner_lr":0.1,"outer_lr":0.01,"inner_steps":3,"budget_fraction":0.2}
Other
other
Multi-token prediction auxiliary heads predicting tokens 2+ steps ahead during training.
parameters: {"num_heads":2,"loss_weight":0.3}
other
Memory tokens: 64 learnable prefix embeddings prepended during training and evaluation.
parameters: {"num_tokens":64}
other
Gradient-guided mixed-bit quantization based on accumulated squared gradients.
parameters: {"top_percent_int7":10,"middle_percent_int6":70,"bottom_percent_int5":20}
other
Cautious weight decay that applies decay only when gradient and weight have the same sign.
parameters: null
other
1M batch size training.
parameters: {"train_batch_tokens":1048576}
other
786K batch size training.
parameters: {"train_batch_tokens":786432}
other
524K batch size training.
parameters: {"train_batch_tokens":524288}
other
cuDNN scaled dot-product attention backend instead of Flash SDP.
parameters: null
other
Canon layers from Allen-Zhu's Physics of Language Models.
parameters: {"K":3}
other
Full-run quantization-aware training with STE fake quantization throughout training.
parameters: null
other
Flash Attention 3 / Hopper-native attention backend.
parameters: null
Regularization
weight decay
parameters: {"value":0.035}
weight decay
parameters: {"value":0.04}
weight decay
parameters: {"value":0.041}
weight decay
parameters: {"value":0.042}
weight decay
parameters: {"value":0.043}
weight decay
parameters: {"value":0.045}
weight decay
parameters: {"value":0.05}
label smoothing
parameters: {"value":0.05}
L1 regularization
parameters: {"lambda":0.0001}
L1 regularization
parameters: {"lambda":0.000001}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- Systematic negative-results study of 13 techniques on top of the PR #315 base model
- Verified that EMA outperforms SWA by about 0.003 BPB
- Showed that weight decay can be used as a precise knob to control compressed artifact size
- Demonstrated that 786K batch size outperforms 524K batch size under the 10-minute wallclock constraint
- Found that Flash Attention 3 on Hopper yields better wallclock performance than slower attention backends in this setting
- Quantified the throughput cost of many techniques, showing that small per-step overheads can dominate final BPB
- Documented that INT4 quantization gap is too large to offset its parameter-count advantage in this track