PR #470

open

Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)

by leofeasbyView on GitHub
val_bpb
1.1454
Architecture
Transformer
Optimizer
AdamW
Artifact Size
13.9MB

Training Techniques

Architecture
depth recurrence / weight sharing
A single transformer block is reused across 9 effective passes, forming a recurrent-style shared-weight stack.
parameters: {"layers":9}
U-Net skip connections
Learned skip connections inject earlier representations back into later passes across the shared-weight stack.
parameters: {"passes":9}
per-layer scaling
Layer-specific attention, MLP, and residual mixing scales are used to break symmetry across reused passes.
parameters: null
tied embeddings
Token embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":16,"num_kv_heads":8}
BigramHash
Includes a hash-based bigram table with 4096 entries.
parameters: {"entries":4096}
Weight Averaging
SWA
parameters: {"snapshots":351,"start_step":32500,"freq":50}
LR Schedule
warmdown
parameters: {"warmdown_start_step":4000,"warmdown_iters":41000,"step_based":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
weight decay
parameters: {"weight_decay":0.04,"scope":"matrix params only"}
Other
other
Extended warmdown training regime with long low-learning-rate phase; majority of gains occur during warmdown.
parameters: {"iterations":50000,"max_wallclock_seconds":86400}

Novel Contributions

  • Shared-weight transformer with a single block reused across depth
  • U-Net style skip connections across recurrent passes
  • Per-layer scaling parameters to differentiate reused passes
  • Step-based warmdown control decoupled from wallclock time
  • Demonstration that most improvement occurs during extended warmdown
  • Use of longer training sequence length (2048) as a major lever