PR #470

open

Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)

by leofeasbyView on GitHub

val_bpb

1.1454

Architecture

Transformer

Optimizer

AdamW

Artifact Size

13.9MB

Training Techniques

Architecture

depth recurrence / weight sharing

A single transformer block is reused across 9 effective passes, forming a recurrent-style shared-weight stack.

parameters: {"layers":9}

U-Net skip connections

Learned skip connections inject earlier representations back into later passes across the shared-weight stack.

parameters: {"passes":9}

per-layer scaling

Layer-specific attention, MLP, and residual mixing scales are used to break symmetry across reused passes.

parameters: null

tied embeddings

Token embeddings are tied.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"num_heads":16,"num_kv_heads":8}

BigramHash

Includes a hash-based bigram table with 4096 entries.

parameters: {"entries":4096}

Weight Averaging

SWA

parameters: {"snapshots":351,"start_step":32500,"freq":50}

LR Schedule

warmdown

parameters: {"warmdown_start_step":4000,"warmdown_iters":41000,"step_based":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

weight decay

parameters: {"weight_decay":0.04,"scope":"matrix params only"}

Other

other

Extended warmdown training regime with long low-learning-rate phase; majority of gains occur during warmdown.

parameters: {"iterations":50000,"max_wallclock_seconds":86400}

Novel Contributions

Shared-weight transformer with a single block reused across depth
U-Net style skip connections across recurrent passes
Per-layer scaling parameters to differentiate reused passes
Step-based warmdown control decoupled from wallclock time
Demonstration that most improvement occurs during extended warmdown
Use of longer training sequence length (2048) as a major lever