PR #73
closedNon-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)
by NishantDahalView on GitHub
val_bpb
1.3281
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.3MB
Training Techniques
Architecture
MLP activation
Replaced ReLU² with SwiGLU gating in the MLP.
parameters: null
MLP hidden size
Reduced MLP hidden dimension to fit artifact budget.
parameters: {"hidden_size":640}
layer recurrence
Reused layers to create depth recurrence, effectively doubling depth by reusing weights.
parameters: {"repeats":2}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.2,"fix":"time-fraction based warmdown instead of iteration-based warmdown_iters=1200"}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Reduced batch size to quarter batch (131K tokens) to increase optimizer steps within fixed wall-clock time.
parameters: {"train_batch_tokens":131072}
other
Used gradient accumulation to increase effective batch size without increasing per-step memory.
parameters: {"grad_accum_steps":2}
Regularization
weight decay
parameters: {"weight_decay":0.01}
Novel Contributions
- Identified and fixed a warmdown schedule bug in stock train_gpt.py where iteration-based warmdown caused LR decay from step 1 under the wall-clock cap.
- Applied SwiGLU activation in place of ReLU².
- Used quarter batch size to obtain more optimizer steps within the same wall-clock budget.
- Used gradient accumulation to improve effective batch size.
- Explored reduced MLP hidden size to stay within the 16MB artifact cap.
- Reported negative results for weight decay and layer recurrence.