PR #1817

open

Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)

by Tonyy1977View on GitHub

val_bpb

1.0903

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.96 MB

Training Techniques

Weight Averaging

SWA

parameters: null

Architecture

BigramHash

Uses bigram hash embeddings in the crawler transformer.

parameters: null

SmearGate

Uses SmearGate in the architecture.

parameters: null

ValueEmbedding

Uses value embeddings in the last 2 layers.

parameters: {"layers":2}

XSA

Applies XSA across all layers.

parameters: {"layers":7}

GQA

Uses grouped query attention.

parameters: {"heads":16,"kv_heads":8}

depth recurrence

Crawler blocks are shared and looped through the network for effective depth recurrence.

parameters: {"effective_depth":7,"loops":2}

Quantization

mixed int5/int6 GPTQ

bits: null

scope: flat attention int5, rest int6, embeddings int8

QAT

bits: 6

scope: training

GPTQ

bits: null

scope: post-quant artifact

Compression

Brotli

level: 11

Test-Time Training

full TTT

parameters: {"freeze":1,"stride":64,"chunk_tokens":32768,"epochs_per_chunk":3,"learning_rate":0.002,"momentum":0.9}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"fraction":0.6,"type":"linear"}

Optimizer

Muon

weight_decay: 0.085

momentum: 0.99

other_params: {"adam_for_scalars":true}

Regularization

logit softcap

parameters: {"value":30}

Novel Contributions

Crawler Transformer architecture with 3 flat blocks and 2 crawler blocks repeated for 7 effective depth
Mixed-int quantization scheme using int5 for flat-block attention and int6 for the rest to avoid pruning
Post-quantization TTT recovery on the GPTQ artifact
Demonstration that more pre-quant training compute substantially improves final BPB
Zero-pruning artifact that fits under the 16 MB budget