PR #73

closed

Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)

by NishantDahalView on GitHub

val_bpb

1.3281

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.3MB

Training Techniques

Architecture

MLP activation

Replaced ReLU² with SwiGLU gating in the MLP.

parameters: null

MLP hidden size

Reduced MLP hidden dimension to fit artifact budget.

parameters: {"hidden_size":640}

layer recurrence

Reused layers to create depth recurrence, effectively doubling depth by reusing weights.

parameters: {"repeats":2}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.2,"fix":"time-fraction based warmdown instead of iteration-based warmdown_iters=1200"}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Reduced batch size to quarter batch (131K tokens) to increase optimizer steps within fixed wall-clock time.

parameters: {"train_batch_tokens":131072}

other

Used gradient accumulation to increase effective batch size without increasing per-step memory.

parameters: {"grad_accum_steps":2}

Regularization

weight decay

parameters: {"weight_decay":0.01}

Identified and fixed a warmdown schedule bug in stock train_gpt.py where iteration-based warmdown caused LR decay from step 1 under the wall-clock cap.
Applied SwiGLU activation in place of ReLU².
Used quarter batch size to obtain more optimizer steps within the same wall-clock budget.
Used gradient accumulation to improve effective batch size.
Explored reduced MLP hidden size to stay within the 16MB artifact cap.
Reported negative results for weight decay and layer recurrence.