PR #927

open

Recursive Transformer 4B/7L + VE + QAT + TTT — val_bpb 1.1696 (3-seed mean)

by Tonyy1977View on GitHub

val_bpb

1.1696

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.85MB

Training Techniques

Architecture

depth recurrence

4 shared transformer blocks are looped 7 times to create recursive depth with weight reuse.

parameters: {"blocks":4,"loops":7,"dim":1024}

U-Net skip connections

Encoder-decoder skip connections across loop iterations with learnable skip weights.

parameters: {"encoder_loops":3,"decoder_loops":4}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":32,"kv_heads":8}

XSA

Cross-Sequence Attention applied in the last 4 loops.

parameters: {"last_n":4}

VE128

ValueEmbedding reinjects token identity into later loops.

parameters: {"dim":128,"last_n":2}

SmearGate

Learned per-dimension gate blending current token with previous token information.

parameters: null

BigramHash

Hash-based bigram embedding using previous and current tokens.

parameters: {"buckets":10240,"dim":128}

Quantization

STE QAT

bits: 6

scope: large weight matrices

GPTQ-lite

bits: 8

scope: final artifact

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.01,"tied_embedding_lr":0.02,"grad_clip":0.3}

Weight Averaging

SWA

parameters: {"start_frac":0.2,"every":50}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500,"warmup_steps":100}

Regularization

weight decay

parameters: {"value":0.04}

Sequence Length

sequence_length

train_length: 2048

eval_length: 32768

Novel Contributions

Recursive transformer with 4 shared blocks looped 7 times for 7x weight reuse
Width-over-depth design using dim=1024 while staying under the 16MB limit
U-Net encoder-decoder skip connections across recursive loops
Int6 QAT from step 0 to prevent compounding quantization error in recursive weight reuse
ValueEmbedding to reinject token identity in later loops
SmearGate, BigramHash, and XSA used in the later loops
Score-first test-time training combined with sliding window evaluation