PR #1640

open

[Non-record] Universal Transformer Depth Recurrence INT6

by thestboboView on GitHub

val_bpb

1.1412

Architecture

Transformer

Optimizer

Muon

Artifact Size

13,841,922 bytes

Training Techniques

Architecture

depth recurrence

Universal Transformer-style recurrence with 6 unique blocks applied 4 times each for 24 effective layers.

parameters: {"unique_blocks":6,"recurrence_steps":4,"effective_layers":24}

weight tying

Shared block weights reused across recurrence steps.

parameters: null

BigramHash

Bigram context embeddings indexed by hashed token bigrams.

parameters: {"buckets":2048,"dim":512}

U-Net skip connections

Stored intermediate states are reused in reverse order during later recurrence steps.

parameters: {"steps":24}

LeakyReLU

Uses LeakyReLU(0.5) squared activation in the MLP.

parameters: {"alpha":0.5}

GQA

Grouped query attention with fewer KV heads than query heads.

parameters: {"num_heads":8,"kv_heads":4}

Quantization

INT6 QAT

bits: 6

scope: block weights

STE QAT

bits: 6

scope: all linear weights

GPTQ

bits: 6

scope: linear layers

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"lr":0.04}

Compression

zlib

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Universal Transformer-style depth recurrence for parameter sharing across layers
FiLM conditioning to specialize shared blocks across recurrence steps
U-Net skip connections adapted to the recurrence loop
BigramHash embeddings for cheap local context modeling
LeakyReLU(0.5)^2 activation for tied-weight recurrence
INT6 QAT with STE and GPTQ-style per-row clipping
Muon optimization with weight decay tuned for compression
Sliding-window evaluation with stride 64