PR #1579

open

Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372

by Tonyy1977View on GitHub
val_bpb
1.1372
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.03 MB

Training Techniques

Architecture
depth recurrence
Crawler/recursive transformer with shared blocks applied in loops; 3 flat blocks plus 2 crawler blocks repeated twice for 7 effective depth.
parameters: {"blocks":4,"loops":7,"dimension":736}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":16,"kv_heads":8}
BigramHash
Hash-based bigram embedding for token pairs.
parameters: {"buckets":10240,"dimension":128}
SmearGate
Learned gate blending current token information with previous state.
parameters: null
XSA
Cross-sequence attention applied in later loops.
parameters: {"last_n":4}
weight tying
Shared transformer blocks reused across loops.
parameters: null
U-Net skip connections
Encoder-decoder style skip connections across recursive loops.
parameters: null
VE128
ValueEmbedding reinjecting token identity into value projection.
parameters: {"dimension":128,"last_n":2}
Quantization
QAT
bits: 6
scope: large weight matrices
GPTQ
bits: 6
scope: all
GPTQ-lite
bits: 6
scope: all
int8
bits: 8
scope: embeddings
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":1,"momentum":0.9,"batch_seqs":32,"grad_clip":1}
Weight Averaging
SWA
parameters: {"start_frac":0.2,"every":50}
LR Schedule
warmdown
parameters: {"warmdown_iters":3500,"warmup_steps":100}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.01,"tied_embed_lr":0.02,"grad_clip_norm":0.3}
Sequence Length
sequence_length
train_length: 2048
eval_length: 32768
Regularization
weight decay
parameters: {"value":0.04}

Novel Contributions

  • Crawler Transformer architecture with shared blocks and recursive looped depth
  • U-Net style skip connections across recursive loops
  • BigramHash, SmearGate, XSA, and ValueEmbedding combined in a compact transformer
  • QAT from step 0 for recursive models to reduce quantization compounding
  • Post-quantization test-time training on the deserialized GPTQ artifact
  • Score-first TTT with sliding-window evaluation