PR #1828

open

Non-record: ETD Hybrid (3 enc + 3×3 think + 4 dec) + Int5 GPTQ + MuonEqR + U-Net — 1.1169 BPB

by 5en5e1View on GitHub

val_bpb

1.1169

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15,865,354 bytes

Training Techniques

Architecture

Hybrid ETD Transformer

Encode-Think-Decode transformer with 3 unique encoder blocks, 3 shared think blocks looped for 3 passes, and 4 unique decoder blocks.

parameters: {"encoder_layers":3,"think_layers":3,"think_passes":3,"decoder_layers":4,"d_model":512,"heads":8,"kv_heads":4}

U-Net skip connections

Skip connections bridge encoder outputs to matching decoder layers with learned per-channel weights.

parameters: {"skips":3}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU(0.5)^2 activation in the MLP.

parameters: {"slope":0.5}

weight tying

Token embedding and LM head are tied.

parameters: null

RoPE

Rotary positional embeddings with base 10000.

parameters: {"base":10000}

logit softcap

Softcaps logits to stabilize output scaling.

parameters: {"value":30}

pass embedding

Learned embedding added at each think pass to distinguish recurrence iterations.

parameters: {"num_passes":3}

Quantization

GPTQ

bits: 5

scope: matrices

int8

bits: 8

scope: embeddings

mixed int5/int8

bits: null

scope: matrices + embeddings

GPTQ

bits: 5

scope: all weights

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.09

momentum: null

other_params: {"row_norm":true,"lr":0.02,"momentum_warmup":[0.92,0.99],"warmup_steps":1500}

AdamW

weight_decay: 0.02

momentum: null

other_params: {"tied_embed_lr":0.03,"scalar_lr":0.02,"betas":[0.9,0.95],"embed_weight_decay":0.085}

Compression

Brotli

level: 11

Other

other

QuIP-style Randomized Hadamard Transform applied before GPTQ to reduce quantization error.

parameters: null

other

Byte-shuffle preprocessing before Brotli compression to improve entropy coding.

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.4}

Regularization

logit softcap

parameters: {"value":30}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

Hybrid Encode-Think-Decode transformer with unique encoder and decoder blocks plus shared looped think blocks
U-Net skip connections between encoder and decoder stages
Progressive recurrence with think-pass embedding
Int5 GPTQ for matrices with int8 embeddings
QuIP-style Randomized Hadamard Transform before quantization
Byte-shuffle plus Brotli-11 compression pipeline
MuonEq-R training with EMA and warmdown