PR #927

open

Recursive Transformer 4B/7L + VE + QAT + TTT — val_bpb 1.1696 (3-seed mean)

by Tonyy1977View on GitHub
val_bpb
1.1696
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.85MB

Training Techniques

Architecture
depth recurrence
4 shared transformer blocks are looped 7 times to create recursive depth with weight reuse.
parameters: {"blocks":4,"loops":7,"dim":1024}
U-Net skip connections
Encoder-decoder skip connections across loop iterations with learnable skip weights.
parameters: {"encoder_loops":3,"decoder_loops":4}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":32,"kv_heads":8}
XSA
Cross-Sequence Attention applied in the last 4 loops.
parameters: {"last_n":4}
VE128
ValueEmbedding reinjects token identity into later loops.
parameters: {"dim":128,"last_n":2}
SmearGate
Learned per-dimension gate blending current token with previous token information.
parameters: null
BigramHash
Hash-based bigram embedding using previous and current tokens.
parameters: {"buckets":10240,"dim":128}
Quantization
STE QAT
bits: 6
scope: large weight matrices
GPTQ-lite
bits: 8
scope: final artifact
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.01,"tied_embedding_lr":0.02,"grad_clip":0.3}
Weight Averaging
SWA
parameters: {"start_frac":0.2,"every":50}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500,"warmup_steps":100}
Regularization
weight decay
parameters: {"value":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: 32768

Novel Contributions

  • Recursive transformer with 4 shared blocks looped 7 times for 7x weight reuse
  • Width-over-depth design using dim=1024 while staying under the 16MB limit
  • U-Net encoder-decoder skip connections across recursive loops
  • Int6 QAT from step 0 to prevent compounding quantization error in recursive weight reuse
  • ValueEmbedding to reinject token identity in later loops
  • SmearGate, BigramHash, and XSA used in the later loops
  • Score-first test-time training combined with sliding window evaluation