PR #1579

open

Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372

by Tonyy1977View on GitHub

val_bpb

1.1372

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.03 MB

Training Techniques

Architecture

depth recurrence

Crawler/recursive transformer with shared blocks applied in loops; 3 flat blocks plus 2 crawler blocks repeated twice for 7 effective depth.

parameters: {"blocks":4,"loops":7,"dimension":736}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":16,"kv_heads":8}

BigramHash

Hash-based bigram embedding for token pairs.

parameters: {"buckets":10240,"dimension":128}

SmearGate

Learned gate blending current token information with previous state.

parameters: null

XSA

Cross-sequence attention applied in later loops.

parameters: {"last_n":4}

weight tying

Shared transformer blocks reused across loops.

parameters: null

U-Net skip connections

Encoder-decoder style skip connections across recursive loops.

parameters: null

VE128

ValueEmbedding reinjecting token identity into value projection.

parameters: {"dimension":128,"last_n":2}

Quantization

QAT

bits: 6

scope: large weight matrices

GPTQ

bits: 6

scope: all

GPTQ-lite

bits: 6

scope: all

int8

bits: 8

scope: embeddings

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":1,"momentum":0.9,"batch_seqs":32,"grad_clip":1}

Weight Averaging

SWA

parameters: {"start_frac":0.2,"every":50}

LR Schedule

warmdown

parameters: {"warmdown_iters":3500,"warmup_steps":100}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.01,"tied_embed_lr":0.02,"grad_clip_norm":0.3}

Sequence Length

sequence_length

train_length: 2048

eval_length: 32768

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Crawler Transformer architecture with shared blocks and recursive looped depth
U-Net style skip connections across recursive loops
BigramHash, SmearGate, XSA, and ValueEmbedding combined in a compact transformer
QAT from step 0 for recursive models to reduce quantization compounding
Post-quantization test-time training on the deserialized GPTQ artifact
Score-first TTT with sliding-window evaluation