PR #1817
openNon-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)
by Tonyy1977View on GitHub
val_bpb
1.0903
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.96 MB
Training Techniques
Weight Averaging
SWA
parameters: null
Architecture
BigramHash
Uses bigram hash embeddings in the crawler transformer.
parameters: null
SmearGate
Uses SmearGate in the architecture.
parameters: null
ValueEmbedding
Uses value embeddings in the last 2 layers.
parameters: {"layers":2}
XSA
Applies XSA across all layers.
parameters: {"layers":7}
GQA
Uses grouped query attention.
parameters: {"heads":16,"kv_heads":8}
depth recurrence
Crawler blocks are shared and looped through the network for effective depth recurrence.
parameters: {"effective_depth":7,"loops":2}
Quantization
mixed int5/int6 GPTQ
bits: null
scope: flat attention int5, rest int6, embeddings int8
QAT
bits: 6
scope: training
GPTQ
bits: null
scope: post-quant artifact
Compression
Brotli
level: 11
Test-Time Training
full TTT
parameters: {"freeze":1,"stride":64,"chunk_tokens":32768,"epochs_per_chunk":3,"learning_rate":0.002,"momentum":0.9}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"fraction":0.6,"type":"linear"}
Optimizer
Muon
weight_decay: 0.085
momentum: 0.99
other_params: {"adam_for_scalars":true}
Regularization
logit softcap
parameters: {"value":30}
Novel Contributions
- Crawler Transformer architecture with 3 flat blocks and 2 crawler blocks repeated for 7 effective depth
- Mixed-int quantization scheme using int5 for flat-block attention and int6 for the rest to avoid pruning
- Post-quantization TTT recovery on the GPTQ artifact
- Demonstration that more pre-quant training compute substantially improves final BPB
- Zero-pruning artifact that fits under the 16 MB budget