PR #420
openWIP: Shared-transformer + warmdown-aligned training (not final submis…
by leofeasbyView on GitHub
val_bpb
1.1454
Architecture
Shared-weight Transformer
Optimizer
—
Artifact Size
13.9MB
Training Techniques
Architecture
weight sharing / depth recurrence
A single transformer block is reused across 9 effective passes instead of using independent layers.
parameters: {"layers":9}
tied embeddings
Token embeddings are tied.
parameters: null
BigramHash
Hash-based bigram embedding table with 4096 entries.
parameters: {"entries":4096}
GQA
Grouped-query attention with 2:1 query-to-KV head ratio.
parameters: {"num_heads":16,"num_kv_heads":8}
MLP×5
Expanded MLP width with relu² activation.
parameters: {"mlp_mult":5}
U-Net skip connections
Encoder-decoder style shared-core transformer with learned skip weights across depth.
parameters: null
Optimizer
AdamW
weight_decay: 0.04
momentum: null
other_params: {"weight_decay_applied_to":"matrix params only"}
Weight Averaging
SWA
parameters: {"start_step":32500,"snapshots":351,"freq":50}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_start_step":4000,"warmdown_iters":41000}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Other
other
Warmdown-aligned training schedule designed to align the low-LR phase with the wallclock budget.
parameters: {"iterations":50000,"max_wallclock_seconds":86400}
Novel Contributions
- Shared-weight transformer architecture with a single block reused across depth
- U-Net-style encoder-decoder structure with learned skip connections
- Step-based warmdown trigger (`WARMDOWN_START_STEP`) decoupled from wallclock time
- Observation that most gains occur during the low-LR warmdown phase
- Use of a 4096-entry hash-based bigram embedding table
- Long-context training at sequence length 2048
- Application of SWA during the late training phase