PR #1325

open

Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack)

by monisha-maxView on GitHub

val_bpb

1.3868

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

7.0 MB

Training Techniques

Regularization

logit softcap

parameters: {"type":"poly5"}

logit softcap

parameters: {"type":"z-loss","weight":0.0001}

adaptive focal loss

parameters: {"gamma":1}

Architecture

RoPE

YaRN positional encoding for improved frequency interpolation

parameters: {"max_len":2048}

BigramHash

Bigram vocabulary embedding component

parameters: {"size":1536}

SmearGate

SmearGate embedding/attention component

parameters: null

U-Net skip connections

U-Net style encoder-decoder skip connections with learned skip weights

parameters: {"encoders":5,"decoders":6}

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"slope":0.5}

XSA

Cross/self attention variant used in the last 4 layers

parameters: {"last_layers":4}

VE128

Value embeddings at later layers

parameters: {"layers":[9,10]}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":16}

Other

other

FA3/FA2/SDPA fallback for broader GPU compatibility

parameters: null

other

Residual vector quantization using int6 base plus int4 residual

parameters: null

other

Progressive depth warmup with staged layer freezing/unfreezing during training

parameters: {"stages":3}

Quantization

mixed int6/int4

bits: 6

scope: weights

Test-Time Training

score-first TTT

parameters: {"epochs":3}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997}

Optimizer

AdamW

weight_decay: 0.04

momentum: null

other_params: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Novel Contributions

Adaptive focal cross-entropy loss
Residual vector quantization
Progressive depth warmup
Poly5 softcap
Z-loss regularization
YaRN positional encoding
zstd-22 compression
Sliding eval stride=16
FA3/FA2/SDPA fallback