PR #1623

open

Record submission: Distill+IntraLoop SP1024 9x512 (val_bpb=1.1942)

by divagr18View on GitHub

val_bpb

1.1942

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.6MB

Training Techniques

Architecture

depth recurrence

Partial intra-loop recurrence where layers 3-4 are executed twice, yielding 11 effective layers from 9 physical layers.

parameters: {"layers":[3,4],"repeats":2,"physical_layers":9,"effective_layers":11}

GQA

Grouped Query Attention with fewer KV heads than query heads.

parameters: {"query_heads":8,"kv_heads":4}

weight tying

Input and output embeddings share weights.

parameters: null

U-Net skip connections

Skip connections between the first and second half of the layer stack.

parameters: null

BigramHash

Residual bigram head mixed with model logits at inference time.

parameters: {"rank":32}

Weight Averaging

EMA

parameters: {"decay":0.999,"weight":0.08,"temp":2,"start_frac":0.7}

SWA

parameters: {"snapshots":282}

Quantization

GPTQ

bits: 8

scope: all

Compression

zstd

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_for":["embeddings","scalar parameters"]}

Other

other

QK-Gain initialization with learnable per-head gain parameters for query and key projections.

parameters: {"init":5}

other

SwiGLU activation in the MLP.

parameters: null

Novel Contributions

Partial depth recurrence applied only to middle layers for near-zero parameter cost
EMA self-distillation during the final portion of training
GPTQ int8 post-training quantization with low roundtrip penalty
Combination of SWA, QK-Gain, GQA, SwiGLU, Muon, tied embeddings, and residual bigram head