PR #223

open

Draft: SOTA+ TTT + RoPE50K + EMA + Curriculum (pending H100 run)

by 0xjaishyView on GitHub

val_bpb

1.1326

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.7MB

Training Techniques

Quantization

mixed int6/int8

bits: null

scope: MLP+Attn int6, embeddings int8

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

Evaluation

sliding window eval

parameters: {"stride":64}

Architecture

SmearGate

Learned per-dimension gate blending token with predecessor

parameters: null

BigramHash

Hash-based token-pair embeddings

parameters: {"buckets":2048}

MLP3x

Wider feed-forward network with 3x hidden expansion

parameters: {"hidden":1536}

tied embeddings

Input and output embeddings are tied

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

RoPE

Rotary position embeddings with increased base for smoother interpolation

parameters: {"base":50000}

U-Net skip connections

Skip connections with learned weights

parameters: null

Initialization

OrthoInit

Orthogonal weight initialization with output scaling

Weight Averaging

EMA

parameters: {"decay":0.995}

Test-Time Training

full TTT

parameters: {"learning_rate":0.0003,"epochs":1,"momentum":0.95}

Sequence Length

sequence_length

train_length: 1024

eval_length: 2048

Other

other

Context-length curriculum: train at seq1024 for first 60% of wallclock, then switch to seq2048

parameters: {"phase1_fraction":0.6}

Novel Contributions

RoPE base 50K for smoother position interpolation at sequence length 2048
LAWA-EMA replacing periodic SWA with stepwise exponential moving average
Context-length curriculum from seq1024 to seq2048 during training
Full-model SGD test-time training on validation data before scoring