PR #1828

open

Non-record: ETD Hybrid (3 enc + 3×3 think + 4 dec) + Int5 GPTQ + MuonEqR + U-Net — 1.1169 BPB

val_bpb
1.1169
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15,865,354 bytes

Training Techniques

Architecture
Hybrid ETD Transformer
Encode-Think-Decode transformer with 3 unique encoder blocks, 3 shared think blocks looped for 3 passes, and 4 unique decoder blocks.
parameters: {"encoder_layers":3,"think_layers":3,"think_passes":3,"decoder_layers":4,"d_model":512,"heads":8,"kv_heads":4}
U-Net skip connections
Skip connections bridge encoder outputs to matching decoder layers with learned per-channel weights.
parameters: {"skips":3}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU(0.5)^2 activation in the MLP.
parameters: {"slope":0.5}
weight tying
Token embedding and LM head are tied.
parameters: null
RoPE
Rotary positional embeddings with base 10000.
parameters: {"base":10000}
logit softcap
Softcaps logits to stabilize output scaling.
parameters: {"value":30}
pass embedding
Learned embedding added at each think pass to distinguish recurrence iterations.
parameters: {"num_passes":3}
Quantization
GPTQ
bits: 5
scope: matrices
int8
bits: 8
scope: embeddings
mixed int5/int8
bits: null
scope: matrices + embeddings
GPTQ
bits: 5
scope: all weights
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.09
momentum: null
other_params: {"row_norm":true,"lr":0.02,"momentum_warmup":[0.92,0.99],"warmup_steps":1500}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"tied_embed_lr":0.03,"scalar_lr":0.02,"betas":[0.9,0.95],"embed_weight_decay":0.085}
Compression
Brotli
level: 11
Other
other
QuIP-style Randomized Hadamard Transform applied before GPTQ to reduce quantization error.
parameters: null
other
Byte-shuffle preprocessing before Brotli compression to improve entropy coding.
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.4}
Regularization
logit softcap
parameters: {"value":30}
Sequence Length
sequence_length
train_length: 2048
eval_length: null

Novel Contributions

  • Hybrid Encode-Think-Decode transformer with unique encoder and decoder blocks plus shared looped think blocks
  • U-Net skip connections between encoder and decoder stages
  • Progressive recurrence with think-pass embedding
  • Int5 GPTQ for matrices with int8 embeddings
  • QuIP-style Randomized Hadamard Transform before quantization
  • Byte-shuffle plus Brotli-11 compression pipeline
  • MuonEq-R training with EMA and warmdown