PR #420

open

WIP: Shared-transformer + warmdown-aligned training (not final submis…

by leofeasbyView on GitHub

val_bpb

1.1454

Architecture

Shared-weight Transformer

Optimizer

—

Artifact Size

13.9MB

Training Techniques

Architecture

weight sharing / depth recurrence

A single transformer block is reused across 9 effective passes instead of using independent layers.

parameters: {"layers":9}

tied embeddings

Token embeddings are tied.

parameters: null

BigramHash

Hash-based bigram embedding table with 4096 entries.

parameters: {"entries":4096}

GQA

Grouped-query attention with 2:1 query-to-KV head ratio.

parameters: {"num_heads":16,"num_kv_heads":8}

MLP×5

Expanded MLP width with relu² activation.

parameters: {"mlp_mult":5}

U-Net skip connections

Encoder-decoder style shared-core transformer with learned skip weights across depth.

parameters: null

Optimizer

AdamW

weight_decay: 0.04

momentum: null

other_params: {"weight_decay_applied_to":"matrix params only"}

Weight Averaging

SWA

parameters: {"start_step":32500,"snapshots":351,"freq":50}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_start_step":4000,"warmdown_iters":41000}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Other

other

Warmdown-aligned training schedule designed to align the low-LR phase with the wallclock budget.

parameters: {"iterations":50000,"max_wallclock_seconds":86400}

Novel Contributions

Shared-weight transformer architecture with a single block reused across depth
U-Net-style encoder-decoder structure with learned skip connections
Step-based warmdown trigger (`WARMDOWN_START_STEP`) decoupled from wallclock time
Observation that most gains occur during the low-LR warmdown phase
Use of a 4096-entry hash-based bigram embedding table
Long-context training at sequence length 2048
Application of SWA during the late training phase