PR #334

open

Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 107min, val_bpb=1.2207, 15.4MB)

by nathon-leeView on GitHub

val_bpb

1.2207

Architecture

GPT

Optimizer

Muon

Artifact Size

15.4 MB

Training Techniques

Architecture

Partial RoPE

Applies rotary position encoding to only a subset of head dimensions.

parameters: {"dimensions":16,"total_head_dims":64}

SmearGate

Per-dimension gate blending current and previous token embeddings.

parameters: null

BigramHash

Hash-based bigram context embeddings.

parameters: {"buckets":2048,"dim":64}

U-Net skip connections

Encoder-decoder style skip connections with learnable weights.

parameters: null

KV head count

Uses 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Uses a 3x ReluSquared MLP.

parameters: null

Regularization

LN scale

parameters: {"formula":"RMSNorm damped by 1/sqrt(layer+1)"}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz":true}

Adam

weight_decay: 0.04

momentum: null

other_params: {"beta1":0.9,"beta2":0.95,"used_for":"scalars/embeddings"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"start":"last 40% of training"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"epochs":3,"frozen_blocks":2}

Initialization

OrthoInit

Orthogonal initialization with muP output-projection scaling.

LR Schedule

cosine warmdown

parameters: {"warmdown_steps":3000}

Novel Contributions

11-layer 512-dim GPT architecture with 8 attention heads and 4 KV heads
Partial RoPE applied to only 16 of 64 head dimensions
LN Scale using RMSNorm damped by 1/sqrt(layer+1)
SmearGate token blending mechanism
BigramHash context embeddings with 2048 buckets and 64 dimensions
U-Net style skip connections with learnable weights
Muon optimizer combined with Adam for embeddings/scalars
EMA plus SWA weight averaging
Uniform int5 quantization with zstd-22 compression
Sliding-window evaluation and full-model test-time training